← All articles

Observability in Agile: Making Your Systems Understandable

By XNM Technologies · March 11, 2023 · 4 min read
Observability in Agile: Making Your Systems Understandable

Observability is the ability to understand what is happening inside a system from its external outputs — the signals the system produces without requiring you to modify or interrogate its internals. The concept originated in control theory, where an observable system is one whose internal states can be inferred from external measurements. In modern software operations, it has come to describe something more specific and more practical: the degree to which the outputs of a production system tell you what the system is doing, why it is doing it, and what is going wrong when things fail.

For Scrum teams, observability is not an infrastructure concern to be handled by a platform team after the software is built. It is a quality attribute of the software itself — and it belongs in the Definition of Done.

The Three Pillars: Logs, Metrics, and Traces

  • Logs answer the question: what happened? A log is a timestamped record of a discrete event — a request received, a database query executed, an error thrown, a payment processed. Well-structured logs (structured in a parseable format like JSON, rather than free-text strings) allow engineers to search and filter the event record of a system with high precision. They are the most detailed signal type and the most expensive to store and query at scale.

  • Metrics answer the question: how is the system performing? A metric is a numeric measurement over time — request count, error rate, latency, memory usage, queue depth. Metrics are lower-resolution than logs (they aggregate events into numbers) but much cheaper to store and faster to query. They are the primary signal for alerting and dashboarding: you set thresholds on metrics and alert when those thresholds are crossed.

  • Traces answer the question: how did a request flow through the system? A trace follows a single request as it moves through all the services, databases, and dependencies involved in handling it, recording timing and metadata at each step. Distributed tracing is essential in microservices architectures, where a single user-facing request may touch dozens of services and the latency contribution of each is invisible without instrumentation.

Why Observability Belongs in the Definition of Done

The Definition of Done is the Scrum team's quality gate: work that meets the Definition of Done is genuinely complete; work that does not is not shippable. Including observability in the Definition of Done is a statement that software which cannot be understood in production is not done — regardless of whether it passes functional tests in a development environment.

The practical consequences of shipping without observability are well-documented: engineers debugging production incidents with no signal other than error messages and user complaints; performance degradation that is invisible until customers start leaving; changes that introduce subtle regressions that are not detected until weeks later because there are no baselines to compare against. These are not edge cases. They are the normal failure mode of software teams that treat observability as optional.

Building Observability Incrementally in Sprints

  1. Structured logging first. The highest return on investment comes from ensuring that every significant system event — requests in, errors, key state transitions, external calls — is logged in a structured, parseable format. This alone dramatically improves the ability to diagnose production problems. It can be added to existing services one at a time without a large architectural investment.

  2. Metrics second. Once structured logging is in place, the team can instrument the key performance metrics for each service: request rate, error rate, latency at the 50th, 95th, and 99th percentile, and resource utilisation. These become the basis for alerting and dashboards. The selection of which metrics to track should be driven by what would actually indicate a problem worth alerting on — not by what is easiest to measure.

  3. Distributed tracing third. Tracing is the most complex layer to instrument and the most valuable in architectures with multiple services. Adding distributed tracing to a service means propagating a trace context through all downstream calls, recording span data at each service boundary, and exporting that data to a tracing backend. For teams on modern cloud infrastructure, managed tracing services reduce the implementation complexity significantly.

SLIs, SLOs, and Error Budgets: Reliability as a Product Conversation

Once meaningful metrics are in place, the Site Reliability Engineering framework gives the Product Owner vocabulary to discuss reliability as a product attribute. A Service Level Indicator (SLI) measures user experience — availability, latency, error rate. A Service Level Objective (SLO) sets the target: 99.5% availability, or 95% of requests under 200 milliseconds. The gap between the SLO and 100% is the error budget — the amount of unreliability the service is allowed to consume. A feature that would consume 20% of the monthly error budget in testing is a different product decision than one that would consume 2%. These trade-offs are product decisions, not only technical ones — and observability is what makes the underlying data available to make them.

XNM Consulting works with agile teams on delivery practices that connect technical quality disciplines — including observability, testing, and DevOps — to product and business outcomes. Learn more about our program and project delivery services.