SRE For Smaller Teams

I've seen Google's Site Reliability Engineering practices being adopted by organisations that are not Google, which is interesting because they don't have the same resources as a hyperscaler. The challenge for these organisations is applying SRE principles in a way that works for them.

Service Level Indicators and Service Level Objectives are a good starting point for any team. You need to define the metrics you measure, such as request latency, error rate, and availability, and set targets, like 99.9% availability and p99 latency.

For example, I've worked with a team that used Prometheus and Grafana to set up their SLO dashboards, with metrics like request latency and error rate. They were able to set targets and track their performance over time, which helped them identify areas for improvement. They aimed for a 99.9% availability target, which meant they could have around 1 minute and 26 seconds of downtime per week.

Error budgets are another key concept from Google's SRE practice. The idea is that you have a certain amount of allowed downtime derived from your Service Level Objective, and when you've used it up, you stop releasing new features until you've restored reliability. For smaller teams, this concept can be applied in a more informal way, with SLO dashboards that everyone can see, a weekly review of error budget consumption, and a shared understanding that SLO breaches require investment in reliability.

I've also seen teams use tools like PagerDuty to manage their on-call rotations and incident response. This helps ensure that the right people are alerted at the right time, and that everyone knows their role in responding to an incident. It's also important to have a clear escalation policy in place, so that if an incident is not resolved within a certain timeframe, it gets escalated to the next level of support.

I've found that toil reduction is a great way for engineering teams to invest in themselves. Toil is all the manual, repetitive work that comes with running a service, and it can really add up. The goal is to spend no more than 50% of your time on toil, and for mid-size teams, identifying the top three sources of toil and automating them can make a big difference. Things like manual deployment steps, certificate renewals, and log triage are all great candidates for automation.

Automation can be achieved through various means, such as using Ansible for deployment automation, or using a tool like Certbot for certificate renewals. I've seen teams automate around 30% of their toil in the first 6 months, which frees up a significant amount of time for more strategic work. However, it's also important to consider the trade-offs, such as the upfront cost of implementing automation, and the potential risks of over-automating.

Sustainable on-call is also crucial for any team. You need high-quality alerts that only go off when something is really wrong, documented runbooks for every alert, a culture of blame-free postmortems, and a rotation that's big enough to give engineers a reasonable work-life balance. If you're getting alerted all the time for low-severity issues, it's going to burn out your team and make it harder to respond to real incidents.

I think one of the biggest mistakes teams make is not prioritising alert quality. If you're getting a lot of low-severity alerts, it's going to be hard to take any of them seriously, and you'll start to ignore them. But if you can filter out the noise and only alert on things that really matter, you'll be able to respond faster and more effectively. For instance, a team I worked with was able to reduce their alert noise by 40% by implementing a more sophisticated alert filtering system, which used machine learning to identify and suppress low-severity alerts.

Documented runbooks are also essential for sustainable on-call. You need to have a clear plan for every possible alert, so that anyone on the team can respond, not just the experts. This takes some upfront work, but it pays off in the long run. I've seen teams create runbooks using tools like Confluence or Google Docs, and make them easily accessible to the entire team.

Finally, I think it's worth noting that on-call rotation size is critical. If you've only got two or three people on rotation, they're going to get burned out fast. You need a minimum of 5-7 engineers to provide a reasonable work-life balance, and even that may not be enough depending on the size of your team and the complexity of your service.

I've seen teams that have implemented these SRE practices and it's made a huge difference. They're able to respond faster to incidents, and they're able to spend more time on real engineering work, rather than just keeping the lights on.