Site Reliability Engineering practices developed by Google at hyperscale are being adapted by organisations that are not Google. The adaptation challenge is applying SRE principles without the resourcing of a hyperscaler.

SLIs and SLOs as the starting point

Service Level Indicators (the metrics you measure: request latency, error rate, availability) and Service Level Objectives (the targets: 99.9% availability, p99 latency < 200ms) are the foundational SRE practice and the most transferable. Defining what good looks like for your service, measuring it, and alerting when it degrades is achievable for any team. The discipline of defining SLOs forces the team to be explicit about the quality they are committing to deliver.

Error budgets without the bureaucracy

At Google, error budgets (the allowed downtime derived from the SLO) govern feature release pace. When the error budget is burned, releases stop until reliability is restored. For smaller teams, the error budget concept applies without the formal governance: SLO dashboards visible to the team, a weekly review of error budget consumption, and a shared understanding that SLO breaches require reliability investment before new features. The formalism scales down.

Toil reduction as engineering investment

Toil is manual, repetitive operational work that grows linearly with service scale. SRE practice holds that engineers should spend no more than 50% of their time on toil. For mid-size teams, identifying the top three toil sources and automating them provides immediate relief. Typical toil candidates: manual deployment steps, certificate renewals, log triage for known alert types, and recurring data cleanup jobs.

On-call sustainability

Sustainable on-call requires: alert quality (alerting only on SLO-impacting events, not noisy low-severity alerts), documented runbooks for every alert, postmortem culture (blame-free analysis of incidents focused on systemic improvement), and rotation size (at minimum 5-7 engineers to provide reasonable work-life balance). Alert fatigue from low-quality alerts is the fastest way to degrade on-call engagement and incident response quality.