Netflix popularised chaos engineering with Chaos Monkey in 2011. A decade later, the practice has evolved from a Netflix-specific innovation to a discipline with standardised practices that organisations at various scales can implement.

What chaos engineering is for

Chaos engineering is not about randomly breaking things. It is about verifying that your system behaves correctly when components fail, and discovering the failure modes before they occur in production. The hypothesis-driven approach: define the expected behaviour under a specific failure condition, inject that failure, and observe whether the system behaves as expected. The value is not the chaos but the learning from how the system responds.

Azure Chaos Studio

Azure Chaos Studio (generally available in 2022) provides managed chaos experiments for Azure resources: VM shutdowns, network partition simulation, high CPU and memory injection, AKS pod failures. The integration with Azure Monitor means experiment effects are visible in the same dashboards you use for production monitoring. For teams that want structured chaos experiments without building the tooling themselves, Chaos Studio is the managed path on Azure.

Game days and their limitations

A game day is a planned chaos exercise where the engineering team injects failures and observes the response. Game days are valuable for discovering failure modes and building team confidence in the runbooks. Their limitation is that they are episodic. A system that passes a game day in March might develop a new fragility in June as code changes accumulate. Continuous chaos, running fault injection automatically in staging environments as part of the deployment pipeline, provides ongoing resilience assurance.

What to inject first

The highest-value chaos experiments for most distributed systems are: removing individual service instances (verifies load balancing and auto-scaling), injecting latency into dependencies (verifies timeout and circuit breaker behaviour), and disrupting the connection to the primary database (verifies failover and recovery). Start with the failure modes that have affected production in the past: chaos engineering as a regression test for previous incidents.