Observability Changes How You Debug

There's a conversation happening in engineering teams about observability versus monitoring, and it actually matters. The distinction shapes how you instrument systems, what questions you can answer in production, and how fast you can debug when things break.

Traditional monitoring watches known metrics and alerts when they cross thresholds. For example, CPU above 80%, error rate above 1%, or latency p99 above 500ms. This approach is effective for known failure modes. You define the signals you care about in advance, set thresholds, and get paged when something crosses them. The limitation is that it only answers questions you thought to ask before the incident.

Observability, as defined by Charity Majors and others, means the system emits enough high-cardinality, high-dimensionality telemetry that you can ask arbitrary questions about its behavior in production and get answers. Instead of alerting on predefined thresholds, you explore the data to understand what happened. The key capability is debugging problems you've never seen before without deploying new instrumentation.

In practice, the shift to observability requires a significant investment in instrumentation. For example, at my previous company, we instrumented every API request with a unique identifier, allowing us to track the request across multiple services. This added about 10-15% overhead to our API latency, but the benefits far outweighed the costs. We were able to debug issues in minutes that previously took hours.

The practical difference between observability and traditional monitoring is structured events versus metrics and logs. A structured event is a JSON document emitted at the boundary of a request, including details like service name, endpoint, response time, user ID, feature flags, database query count, external call latencies, and any other fields relevant to that request. With structured events stored in a columnar store, you can query any dimension, such as slow requests for a specific user, error rates broken down by feature flag variant, or latency for requests that touched a specific external service.

For example, we used Honeycomb to store and query our structured events. We were able to ask questions like 'What is the error rate for requests from users in region X?' or 'What is the average latency for requests that touch our payment gateway?' These queries would have been impossible with traditional monitoring tools. We also used OpenTelemetry to standardize our instrumentation layer across vendors, which made it easier to switch between different observability tools.

The tools that matter today for observability include Honeycomb, which is the reference implementation, founded by Charity Majors. Lightstep and Grafana Tempo provide distributed tracing. The OpenTelemetry project standardizes the instrumentation layer across vendors. The shift from custom metrics and log parsers to structured event emission and trace-based debugging is measurable in how quickly incidents are resolved, from hours of grep and metric dashboard switching to minutes of structured query exploration.

This shift in approach allows teams to move faster and be more efficient in debugging production issues. It's not just about having more data; it's about having the right kind of data that allows for flexible querying and exploration.

The distinction between observability and monitoring is not just academic; it has practical implications for how teams instrument their systems and debug issues. By adopting an observability approach, teams can gain a deeper understanding of their systems and improve their ability to debug complex issues.

As teams continue to adopt observability, it's likely that we'll see even more innovative tools and approaches emerge. The key is to stay focused on the core principles of observability and to continually evaluate and improve instrumentation and debugging practices.