Chaos Engineering Gets Real

I've seen chaos engineering move from a Netflix innovation to a practice with real tooling, frameworks, and adoption in organisations that aren't hyperscalers. The state of chaos engineering in 2021 is more accessible than it was in 2015, and that's a big deal.

The core idea behind chaos engineering is to intentionally inject failures into a production or production-like system to validate its resilience properties. You make a prediction based on the system's design, inject a failure, and then compare the prediction to what actually happens. Any discrepancies reveal gaps in resilience.

Netflix's Chaos Monkey is a great example of this in action. It randomly terminates production EC2 instances during business hours to verify that the Netflix service can survive individual instance failures. The Simian Army takes this further with tools like Latency Monkey, Conformity Monkey, and Chaos Gorilla, which can take down entire availability zones.

What's key here is that chaos in production during business hours, when engineers are available to respond, surfaces real resilience gaps. If you run chaos experiments in the middle of the night, you're more likely to surface incidents instead.

For Kubernetes, Chaos Mesh and LitmusChaos provide native chaos engineering capabilities. You define chaos experiments as Kubernetes custom resources, which can include things like pod failure, network delay, network partition, CPU stress, memory stress, and filesystem failure. This integrates chaos experiments into your CI/CD pipelines for automated resilience validation.

In one of my services that runs on a 200‑node EKS cluster, we wired Chaos Mesh to inject a 500 ms network delay on 5 % of the pods every hour. The first few runs tripped alerts in our Prometheus stack because the latency spike pushed request latency past our SLO threshold of 200 ms. We quickly realized we needed a separate metric that measured latency under fault injection, otherwise we were confusing normal load spikes with chaos‑induced degradation. The lesson was that you must isolate the chaos signal in your observability pipeline, otherwise you waste time chasing false alarms.

I've found that running a GameDay is a great way to make chaos engineering a team practice. It's a structured chaos experiment session where the engineering team runs failure scenarios in a controlled environment and observes the system response. You define a set of failure scenarios in advance, run each one with the team watching dashboards and runbooks, and document any gaps between expected and actual behaviour.

During a GameDay we used a combination of Grafana dashboards and a dedicated Slack channel to surface the chaos events. We scripted the experiment start with a simple Helm hook so that the chaos pod spun up, ran for exactly three minutes, and then terminated. The team watched the error rates climb from 0.02 % to 2 % in real time, and our runbook instructed us to disable the downstream cache for the duration of the test. After the exercise we recorded a 30 % drop in mean time to recovery for similar incidents that later occurred in production, simply because the engineers had rehearsed the steps under pressure.

GameDays are a way to prioritise fixes and make sure that chaos engineering isn't just an individual exercise. By running these experiments in a controlled environment, you can identify and fix resilience gaps before they become incidents.

Extending chaos beyond a single cloud required us to treat the blast radius as a first‑class concern. We used Terraform to spin up identical chaos experiments in both AWS us‑east‑1 and GCP europe‑west1, then used a simple Go wrapper to coordinate a simultaneous network partition between the two regions. The coordination added latency to our test run, but it revealed a hidden dependency on a DNS resolver that was only reachable from AWS. The fix was to add a secondary resolver in GCP, a change that would have been missed without a multi‑cloud fault injection.

The fact that chaos engineering has moved beyond hyperscalers is a testament to its effectiveness. It's no longer just a niche practice, but a mainstream way to validate system resilience and improve overall reliability.

As I see it, the future of chaos engineering is all about making it more accessible and integrated into everyday engineering practice. With the right tooling and frameworks, any organisation can adopt chaos engineering and improve its system resilience.