Site Reliability Engineering and Scrum: Building for Production

By XNM Technologies · March 27, 2023 · 4 min read

In 2003, Google faced a problem that few organisations had encountered before: how do you operate software systems at a scale where traditional operations approaches — manual runbooks, reactive incident response, siloed development and ops teams — simply cannot keep up? Their answer was Site Reliability Engineering: hire software engineers into operations roles and give them the mandate to solve operational problems with software. Two decades later, SRE has become a mainstream discipline, and its core concepts — SLIs, SLOs, error budgets, toil, blameless postmortems — are relevant to far smaller teams than Google's. For Scrum teams in particular, SRE provides a set of tools for answering a question that Scrum itself leaves unanswered: how do you balance the pressure to deliver new features against the obligation to keep the system running well?

SLIs and SLOs: reliability as a product commitment

A Service Level Indicator (SLI) is a quantitative measure of a service's behaviour from the user's perspective. Common SLIs are availability (what percentage of requests succeed), latency (what proportion of requests complete within a threshold), and error rate (what fraction of requests result in an error). A Service Level Objective (SLO) is a target for an SLI: 99.9% of requests complete within 200 milliseconds; availability above 99.5% measured over a rolling 28-day window. SLOs are not aspirational — they are operational commitments that the team is responsible for maintaining. In a Scrum context, the Product Owner should own the SLOs. They represent the product's reliability promise to its users, and decisions about when to accept reliability risk in exchange for shipping features faster are fundamentally product decisions, not engineering decisions.

Error budgets: making the trade-off explicit

The error budget is the arithmetic complement of the SLO. If your availability SLO is 99.9%, your error budget is 0.1% of time — about 43 minutes per month — that the service can be unavailable without breaching the SLO. The error budget is spent by incidents, by releases that cause errors, by maintenance windows. When the error budget is healthy (mostly unspent), the team has headroom to ship aggressively, take on riskier changes, and experiment. When the error budget is depleted, the right response is to freeze feature releases and invest in reliability until the budget recovers. This is what makes error budgets a powerful Scrum tool: they transform the feature-vs-reliability tension from a political argument between the PO and engineering into a data-driven policy. The policy is agreed in advance; the error budget tells you which mode you are in.

The toil budget and postmortems feeding the backlog

Toil is SRE's term for manual, repetitive, automatable operational work that scales linearly with service growth: manual deployments, routine log triage, manual scaling operations. Google's guideline is that no more than fifty percent of an SRE's time should be spent on toil; the rest should go to engineering work that reduces future toil. For a Scrum team, measuring the toil budget creates a forcing function: if the team is spending thirty percent of every Sprint on operational toil, that is a backlog item — reducing toil through automation — that can be sized, prioritised, and planned.
Blameless postmortems are a structured process for learning from incidents without assigning personal fault. A postmortem documents the timeline of an incident, the contributing factors (systemic, not individual), and the action items that would prevent recurrence or reduce impact. Those action items are the postmortem's most important output — and they belong in the product backlog. A Scrum team that runs postmortems but does not add the resulting action items to the backlog is performing the ceremony without extracting the value. Each postmortem action item is a reliability story that the Product Owner should size and prioritise like any other backlog item.

The tension between SRE and Scrum

The integration of SRE and Scrum is not without friction. Scrum's Sprint cadence creates pressure to deliver shippable increments every two weeks; SRE's error budget policy may require that feature shipping pauses when reliability is degraded. The Product Owner role in Scrum is product-focused; SRE asks the PO to take explicit ownership of reliability commitments, which requires technical literacy that not all POs have. SRE also introduces metrics and measurement overhead that teams in early stages of Scrum maturity can find burdensome. The resolution is to introduce SRE concepts progressively. Start with a single SLO for the most critical user-facing service. Define the error budget. Run one blameless postmortem after the next significant incident. Measure toil for one Sprint. Each of these is a low-cost experiment that builds the muscle before committing to a full SRE operating model.

If your Scrum team is struggling to balance feature delivery with the reliability demands of a production system, XNM's program and project delivery advisory can help you design a delivery model that incorporates SRE principles without overwhelming a team still building Scrum maturity.

Lean Leadership: What Leaders Must Do Differently

March 28, 2023

Site Reliability Engineering and Scrum: Building for Production

SLIs and SLOs: reliability as a product commitment

Error budgets: making the trade-off explicit

The toil budget and postmortems feeding the backlog

The tension between SRE and Scrum

Related articles

Lean Leadership: What Leaders Must Do Differently

Onshoring Strategic Inputs: When Control Matters More than Cost

The Power of the Pre-Mortem: Anticipating Failure Before It Happens