I've seen it happen too many times - observability retrofitted into a production system is expensive and incomplete. It's much better to design systems for observability from the start. They're significantly easier to operate and debug.
Structured logging is a good foundation. Every application log line should be a structured JSON event, not a formatted string. This way, you can query logs by field value - for example, find all errors from service X with user ID Y in the last hour. Formatted strings require regex parsing that often misses context and breaks when the format changes. The investment is choosing a structured logging library like Serilog for .NET, Zap for Go, or Pino for Node.js, and defining standard fields like service name, version, environment, correlation ID, and user ID as logging context.
In my experience, choosing a structured logging library is just the first step. The real challenge is defining the standard fields that will be used across all services. For instance, I once worked on a system where we had multiple services logging with different versions of the same library. To fix this, we had to create a central logging configuration that defined the standard fields and ensured consistency across all services. This involved creating a shared library that all services could use to log events, and then updating our CI/CD pipeline to enforce the use of this library.
Correlation IDs help track requests across service boundaries. A correlation ID is a UUID generated at the request entry point and propagated through every log, span, and metric associated with a request. When an incident requires understanding what happened during a specific request, the correlation ID is the query key that surfaces all relevant telemetry. To implement this, generate the ID at the API gateway or load balancer, propagate it in HTTP headers like X-Correlation-ID or the W3C traceparent, and inject it into the logging context in every service.
But correlation IDs are not enough on their own. To really understand what's happening in your system, you need to be able to drill down into individual requests and see their entire lifecycle. This is where distributed tracing comes in. By injecting a trace ID into every log and metric, you can see the entire request path, including any errors that occurred along the way. In our system, we use OpenTracing to inject the trace ID into every log and metric, and then use a tracing tool like Zipkin to visualize the request lifecycle.
Error budget dashboards make the health of the system visible to the entire team, not just the oncall engineer. A dashboard showing current availability versus the SLO, remaining error budget for the month, and burn rate alerts creates shared awareness of production health. Teams that see error budget consumption in real time make better architectural trade-offs between feature velocity and reliability investment.
One of the key benefits of error budget dashboards is that they provide a clear and actionable metric for the team to work towards. By setting a target availability of 99.99% for example, the team can focus on reducing errors and improving reliability. In our system, we use Prometheus to collect metrics on error rates and availability, and then use Grafana to create a dashboard that shows the current error budget and burn rate.
Runbook-driven alerting is essential. Every production alert should have a runbook - a documented procedure for investigation and remediation. Alerts without runbooks produce slow, inconsistent incident response and oncall fatigue. A runbook should include the alert trigger condition, likely causes, investigation steps, remediation actions, and escalation criteria. These are living documents, so update them after every incident where the runbook was insufficient.
In my experience, creating runbooks is a team effort. The oncall engineer should work with other teams to document the procedures for investigation and remediation. This involves not just writing down the steps, but also understanding the likely causes and escalation criteria. For instance, if the alert is triggered by a high error rate, the runbook should include steps for investigating the root cause, such as checking the logs and metrics for signs of a larger issue. It should also include escalation criteria, such as when to involve other teams or escalate to a higher level of support.