Observability has evolved from three-pillar metrics-logs-traces infrastructure into a discipline about understanding system behaviour. The tools have matured and the practices have solidified.

OpenTelemetry reaches production maturity

OpenTelemetry, the CNCF project that standardises instrumentation across metrics, logs, and traces, reached stable releases for its core languages in 2023. The value is vendor-neutral instrumentation: you instrument your application once with OpenTelemetry, and route the data to whichever backend you choose (Datadog, Jaeger, Prometheus, Elastic). Replacing your observability backend does not require re-instrumenting your applications. This is a substantial reduction in the switching cost of observability tooling.

The observability vs monitoring distinction

Monitoring asks: is the system healthy? It checks known conditions against defined thresholds. Observability asks: why is the system behaving this way? It provides the data needed to answer questions you have not anticipated. The practical difference: monitoring alerts you when CPU is above 80%. Observability lets you trace the specific chain of events that caused a cascade failure during a Black Friday traffic spike to the slow database query on line 847 of your inventory service.

Logs as structured events

Unstructured log strings are difficult to query at scale. The practice of structured logging, emitting JSON log events with consistent fields, allows logs to be queried with the same operators as database records. The shift from free-text log messages to structured log events is a prerequisite for useful log analysis in distributed systems. Libraries like Serilog for .NET and Zap for Go make structured logging straightforward.

SLOs as the contract between reliability and product

Service level objectives, the agreed reliability targets for a service, are the mechanism for having honest conversations between engineering and product about the tradeoff between feature velocity and system reliability. If a service is meeting its SLO, the error budget is green and new features can be deployed. If the error budget is being consumed, reliability work takes priority. The SLO framework makes the tradeoff explicit and measurable.