Incident Management and Scrum: Handling Production Issues in an Agile Team

By XNM Technologies · December 9, 2022 · 4 min read

Every Scrum team that supports a production system eventually faces the same tension. The Sprint is two weeks of focused, planned work. Incidents are unpredictable, urgent, and frequently disrespectful of Sprint boundaries. Without a deliberate approach to managing this tension, teams either chronically underdeliver on Sprint commitments or leave production issues festering because everyone is "in the Sprint."

The Core Tension

Scrum is designed for planned work. The Sprint backlog is a commitment — a forecast of what the team will deliver by the end of the Sprint based on their capacity and velocity. Incidents are unplanned. They arrive without warning and, if they are serious enough, they demand immediate attention regardless of what is on the board.

Teams that have not resolved this tension tend to fall into one of two failure modes. The first is that incidents pull people out of the Sprint constantly, making the team's velocity unpredictable and Sprint commitments meaningless. The second is that the team protects the Sprint fiercely and production issues pile up, degrading the user experience and accumulating technical debt.

A Dedicated Incident Response Capacity

The most reliable solution is to reserve a portion of each Sprint's capacity explicitly for unplanned work — including incidents. This is not the same as undercommitting to the Sprint backlog and hoping the difference absorbs surprises. It is a deliberate allocation: if the team's total capacity is, say, forty story points, they commit to thirty in the Sprint backlog and hold ten in reserve for incidents, bugs, and other unplanned demand.

The reserved capacity should be sized based on actual historical data — how many hours per Sprint has the team typically spent on unplanned work over the past six to twelve months? Use that number, not a guess.

An On-Call Rotation

For teams that support production systems with meaningful uptime expectations, a formal on-call rotation is essential. Without one, incidents are handled by whoever is available (or whoever feels obligated), creating an uneven and unsustainable burden. The on-call person is the first responder for production issues during their rotation; other team members stay in the Sprint.

The rotation should be explicit, published, and fair. The on-call person should not be expected to carry a full Sprint workload during their rotation — they should be protected from Sprint commitments, or their Sprint capacity should be reduced to reflect the on-call overhead.

Blameless Post-Mortems

After a significant incident, a blameless post-mortem is one of the most valuable practices an Agile team can adopt. The goal is not to find someone to hold responsible — it is to understand what happened, why it happened, and what systemic changes would prevent it from happening again.

The "blameless" framing matters. When people fear blame, they withhold information, minimise their role, and avoid the candid conversation that produces genuine learning. When the culture is explicitly blameless, people describe what they actually did and why — which is the information you need to fix the system.

Tracking Tech Debt From Incidents in the Product Backlog

Every significant incident leaves behind residue — monitoring gaps, missing error handling, brittle integrations, or configuration management debt. This residue should be captured as items in the product backlog, prioritised by the Product Owner alongside feature work, and addressed systematically.

Teams that do not do this tend to see the same categories of incidents recur. The post-mortem produces action items; the action items go on a list; the list is forgotten under the pressure of Sprint work. Putting the tech debt in the backlog, with a business rationale attached, gives it a fighting chance of being addressed.

The Relationship With the Definition of Done

A well-crafted Definition of Done (DoD) is one of the best incident prevention tools available to a Scrum team. If the DoD requires that every increment is tested in a production-like environment, that observability tooling is in place before a feature ships, and that error handling is implemented for all external dependencies, fewer incidents will occur.

Review your DoD after significant incidents. If an incident could have been prevented by something that should have been in the DoD but was not, add it.

Preventing Incidents From Becoming the Default Mode

The most dangerous outcome for a Scrum team that supports production is "firefighter mode" — a state in which so many incidents are occurring that incident response becomes the team's primary activity, planned work is essentially impossible, and the backlog grows indefinitely.

Getting out of firefighter mode requires management support, because the solution involves taking time away from feature delivery to pay down the technical debt and stability issues that are generating the incidents. It is not something a team can solve through better Sprint planning alone. The Product Owner and stakeholders need to understand and approve the investment.

XNM works with Agile teams to design processes that can handle the reality of production operations — not just the idealised version. Our program and project delivery services include Scrum coaching and process design for teams that need to balance planned delivery with operational support.

Root Cause Analysis: Choosing the Right Tool

December 10, 2022

Incident Management and Scrum: Handling Production Issues in an Agile Team

The Core Tension

A Dedicated Incident Response Capacity

An On-Call Rotation

Blameless Post-Mortems

Tracking Tech Debt From Incidents in the Product Backlog

The Relationship With the Definition of Done

Preventing Incidents From Becoming the Default Mode

Related articles

Root Cause Analysis: Choosing the Right Tool

Supply Chain Digitalisation: Where to Start

The PMO: What It Does and When You Need One