There's a conversation happening in engineering teams about observability vs monitoring, and it matters. The distinction shapes how you instrument systems, what questions you can actually answer in production, and ultimately how fast you can debug when things break.

The traditional monitoring approach

Traditional monitoring watches known metrics and alerts when they cross thresholds. CPU > 80%, error rate > 1%, latency p99 > 500ms. Monitoring is effective for known failure modes. You define the signals you care about in advance, set thresholds, and get paged when something crosses them. The limitation is that it only answers questions you thought to ask before the incident.

What observability actually gives you

Observability, as defined by Charity Majors and others, means the system emits enough high-cardinality, high-dimensionality telemetry that you can ask arbitrary questions about its behaviour in production and get answers. Instead of alerting on pre-defined thresholds, you explore the data to understand what happened. The key capability is debugging problems you've never seen before without deploying new instrumentation.

Structured events make the difference

The practical difference is structured events vs metrics and logs. A structured event is a JSON document emitted at the boundary of a request: service name, endpoint, response time, user ID, feature flags, database query count, external call latencies, and any other fields relevant to that request. With structured events stored in a columnar store, you can query any dimension: slow requests for a specific user, error rates broken down by feature flag variant, latency for requests that touched a specific external service.

The tools that matter today

Honeycomb (founded by Charity Majors) is the reference implementation of observability. Lightstep and Grafana Tempo provide distributed tracing. The OpenTelemetry project standardises the instrumentation layer across vendors. The shift from custom metrics and log parsers to structured event emission and trace-based debugging is measurable in how quickly incidents are resolved, from hours of grep and metric dashboard switching to minutes of structured query exploration.