Chaos engineering has moved from Netflix innovation to a practice with tooling, frameworks, and adoption in organisations that are not hyperscalers. The 2021 state of chaos engineering is more accessible than it was in 2015.

The chaos engineering hypothesis

Chaos engineering is the practice of intentionally injecting failures into a production or production-like system to validate that the system's resilience properties hold. The hypothesis-driven approach: before injecting a failure, predict what will happen based on the system's design. After injection, compare the prediction to the observed behaviour. Discrepancies reveal resilience gaps.

Chaos Monkey and the Simian Army

Netflix's Chaos Monkey randomly terminates production EC2 instances during business hours to verify that the Netflix service survives individual instance failures. The broader Simian Army (Latency Monkey, Conformity Monkey, Chaos Gorilla for whole availability zones) extended the practice. The key principle: chaos in production during business hours when engineers are available to respond surfaces real resilience gaps; chaos in the middle of the night surfaces incidents.

Chaos Mesh and LitmusChaos for Kubernetes

Chaos Mesh (PingCAP, CNCF project) and LitmusChaos (MayaData, CNCF project) provide Kubernetes-native chaos engineering. Chaos experiments are defined as Kubernetes custom resources: pod failure (random pod deletion), network delay (add latency to a service), network partition, CPU stress, memory stress, and filesystem failure. The Kubernetes-native model integrates chaos experiments into CI/CD pipelines for automated resilience validation.

The GameDay format

A GameDay is a structured chaos experiment session where the engineering team runs failure scenarios in a controlled environment and observes the system response. The format: define a set of failure scenarios in advance, run each scenario with the team watching dashboards and runbooks, document gaps between expected and actual behaviour, prioritise fixes. GameDays are the organisational mechanism for making chaos engineering a team practice rather than an individual exercise.