Google's Reliability Standard Hits the Mainstream

Google's reliability standard has been Service Level Objectives for over a decade, and it's now within reach for teams of all sizes. The widespread industry adoption has led to a practical toolkit for organisations looking to manage reliability in a systematic way.

A Service Level Indicator is a metric that measures a crucial aspect of service behaviour from the user's perspective, such as availability, latency, or error rate. The choice of SLI matters: selecting a metric that's easy to measure but doesn't reflect user experience can lead to misleading results.

In my early attempts to instrument SLIs I learned that the ease of collection can become a hidden cost. We started with a per‑request latency histogram in Prometheus that emitted a bucket for every millisecond. Within a week the time‑series database hit 80 % memory utilisation on a 16 GB node and query latency doubled. The fix was to coarsen the buckets to 5 ms intervals and to roll up older data into Thanos stores. The lesson is that an SLI must be both observable and affordable; otherwise the monitoring layer itself becomes a source of outage.

Google's error budget model defines a target for an SLI, like 99.9% of requests completing within 200ms over a 30-day window. This leaves 0.1% of requests as an error budget, making the cost of unreliability visible and forcing teams to prioritise reliability work when the budget is being consumed.

One of the toughest moments was a 0.07 % error budget burn on a payment API during a Black Friday promotion. The budget was being consumed in under an hour, and our alert fired at 02:17 UTC. The on‑call engineer discovered a downstream dependency throttling requests after a sudden spike. We rolled back the new feature flag, restored the budget, and then instituted a guardrail that caps rollout velocity when the remaining budget falls below 20 %. That guardrail saved us from a cascade of failures that would have hit the SLA.

When setting SLOs, it's essential to start with realistic targets based on observed performance rather than aspirational goals. Most services should begin with an SLO based on their current level of performance, and then work towards improvement. Starting too high can lead to demoralised teams and perpetual SLO breaches.

We also found that setting an SLO at the service boundary without looking at downstream impact can be misleading. After a few months we added an SLO for the aggregate latency of a user journey that spanned three micro‑services. The combined SLO was tighter than any individual component, and the error budget started to burn faster. The trade‑off was clear: tighter user‑experience SLOs drive more cross‑team coordination but require more sophisticated tracing. We adopted OpenTelemetry tracing across the stack, which added about 5 ms of overhead per request, a cost we accepted because it gave us the visibility to keep the budget in check.

While SLO reporting tools like Azure Monitor workbooks, Prometheus, and Grafana offer valuable insights, the tooling choice is secondary to getting the SLI measurement right and teams aligned on the error budget model. Dedicated platforms like Nobl9 and Sloth can also be effective, but only if used correctly.