Observability has matured beyond monitoring

Observability has come a long way from its early days as a three-pillar infrastructure of metrics, logs, and traces. It has matured into a discipline focused on understanding system behaviour. The tools have matured and the practices have solidified.

OpenTelemetry, a CNCF project that standardises instrumentation across metrics, logs, and traces, reached stable releases for its core languages in 2023. This brings vendor-neutral instrumentation, allowing you to instrument your application once and route the data to any backend you choose, such as Datadog, Jaeger, Prometheus, or Elastic. This decoupling significantly reduces the cost of switching observability tooling.

The terms observability and monitoring are often used interchangeably, but they serve distinct purposes. Monitoring checks if the system is healthy by comparing known conditions against defined thresholds. Observability, on the other hand, seeks to understand why the system is behaving in a certain way, providing data to answer unforeseen questions. In practical terms, monitoring alerts you when CPU usage exceeds 80%, whereas observability enables you to trace the specific sequence of events that led to a cascade failure during a Black Friday traffic spike, pinpointing the slow database query on line 847 of your inventory service.

In my experience, the shift to observability has been driven by the need to manage complex distributed systems, where the interactions between microservices can lead to emergent behaviour that is difficult to predict. For example, I have seen a system where a change to a downstream service caused a 30% increase in latency, which in turn caused a cascade failure of the upstream services. With observability tools like New Relic and Lightstep, we were able to identify the root cause of the issue and make targeted changes to resolve it. This would have been much harder to do with traditional monitoring tools, which would have only alerted us to the fact that the system was experiencing high latency, without providing any insight into the underlying causes.

Logs are often emitted as unstructured strings, which can be challenging to query at scale. Structured logging, which involves emitting JSON log events with consistent fields, allows logs to be queried with the same operators as database records. This shift from free-text log messages to structured log events is essential for useful log analysis in distributed systems. Libraries such as Serilog for .NET and Zap for Go make structured logging straightforward to implement.

One of the key challenges in implementing structured logging is dealing with the tradeoff between log verbosity and performance. If you log too much data, you can overwhelm your logging system and impact the performance of your application. On the other hand, if you log too little data, you may not have enough information to debug issues when they arise. I have found that a good approach is to use a logging framework that allows you to dynamically adjust the log level based on the needs of your application. For example, you can use a framework like Log4j to log debug-level messages in development, but switch to a higher log level in production to reduce the volume of log data.

Service level objectives, or SLOs, are agreed-upon reliability targets for a service. They facilitate honest conversations between engineering and product teams about the tradeoff between feature velocity and system reliability. When a service meets its SLO, the error budget is green, and new features can be deployed. If the error budget is being consumed, reliability work takes priority. The SLO framework makes this tradeoff explicit and measurable. For instance, a service with an SLO of 99.9% uptime may be allowed to have 1.44 minutes of downtime per day, which can be used to schedule maintenance or deploy new features. This approach allows teams to make data-driven decisions about when to prioritize reliability work versus new feature development.

In practice, implementing SLOs requires a significant amount of data and analysis. You need to be able to measure the reliability of your service, which can be challenging in complex distributed systems. I have found that using tools like Google Cloud Monitoring and AWS CloudWatch can provide the necessary data to measure service reliability and inform SLO decisions. Additionally, using a framework like the SLO framework developed by Google can provide a structured approach to defining and managing SLOs.