Observability: Going Beyond Monitoring for Production Systems
Production incidents have a way of exposing the limits of traditional monitoring. The alert fires, the on-call engineer is paged, and the question immediately becomes: what is actually happening inside the system right now? Traditional monitoring was designed to answer a different question — 'is the system up?' — by checking predefined metrics against predefined thresholds. When the system fails in a way that nobody predicted, those predefined checks are often silent while users are already experiencing the impact.
Observability addresses this gap. Where monitoring is about tracking the states you already know to look for, observability is about the ability to understand any state the system might get into — including states you have never seen before — purely from its external outputs.
The Three Pillars of Observability
Observability is typically described in terms of three complementary data types that, together, give engineers the ability to investigate any production issue:
Logs are the oldest form of application telemetry. Structured logging — where log entries are emitted as machine-readable records (typically JSON) rather than unstructured text strings — transforms logs from a last-resort debugging tool into a queryable, filterable data source. Structured logs let you ask questions like 'show me all requests that failed for user ID 4821 in the last two hours' without grepping through raw text. In Scrum teams, the shift to structured logging is often one of the highest-value, lowest-cost improvements a team can make to their observability posture.
Metrics are aggregated, time-series representations of system state. Unlike logs, which record individual events, metrics summarise what is happening across many events over time — request rates, error rates, latency percentiles, queue depths, and resource utilisation. Metrics are efficient to store and fast to query, making them ideal for dashboards and alerting. The discipline is in choosing the right metrics: the four golden signals (latency, traffic, errors, and saturation) provide a useful starting framework for any service.
Traces are the observability primitive that is most specific to distributed systems. A trace records the path of a single request as it propagates through multiple services, capturing the time spent at each hop. When a request is slow or fails in a microservices architecture, traces answer the question that logs and metrics cannot: which specific service, database call, or external dependency was the bottleneck? Distributed tracing requires trace IDs to be propagated consistently across service boundaries — a discipline that needs to be built in from the start, not retrofitted.
How Scrum Teams Build Observable Systems
Observability is not a feature that gets added to a system after it is built. It needs to be woven into the development process from the first Sprint. Scrum teams that build highly observable systems tend to share several practices:
Observability in the Definition of Done. When the DoD requires that every new feature emits structured logs, key metrics, and propagated trace IDs before it can be accepted, observability becomes the default rather than an afterthought. Teams that add observability to the DoD report that instrumentation time averages three to five per cent of feature development time — far less than the cost of a single major production incident investigated without adequate telemetry.
Feature flags for gradual rollout visibility. Feature flags allow teams to release to a controlled subset of users before full rollout. Combined with the observability stack, feature flags make it possible to compare error rates, latency, and user behaviour between the flag-on and flag-off populations in near real-time. This turns production release from a binary event into a measured, observable process.
Runbooks as living documents updated each Sprint. Runbooks that describe how to investigate and resolve known failure modes are valuable, but only if they are current. Scrum teams that treat runbooks as Sprint artefacts — updated during the Sprint Review when production incidents reveal gaps — consistently have shorter mean time to recovery on repeated failure patterns.
Alerting on symptoms, not causes. A common anti-pattern is alerting on the internal state of the system (CPU usage over 80%, heap usage over 70%) rather than on the symptoms that users experience (error rate above 0.5%, p99 latency above 2 seconds). Symptom-based alerting reduces alert fatigue and ensures that the alerts that fire are always worth waking someone up for.
Observability as a Team Practice
The tools for observability — logging frameworks, metrics collectors, distributed tracing libraries, and the platforms that aggregate them — have never been more accessible. OpenTelemetry has emerged as the open standard for instrumentation, and managed observability platforms have made it practical for teams of any size to build production-grade observability without managing complex infrastructure.
What remains difficult is the cultural and process dimension: building a team that treats production visibility as a shared responsibility, that invests consistently in telemetry as part of feature delivery, and that improves its observability posture incrementally through the feedback loops that Scrum provides. The teams that do this well are the ones that spend the least time in major incident response — because they find and fix problems before they become incidents, or recover from them in minutes rather than hours when they do.
XNM Consulting supports Scrum teams and engineering organisations building reliable, observable production systems. Learn more about our delivery and programme support on our Program and Project Delivery page.