Chaos engineering isn't just for game days

Netflix popularised chaos engineering with Chaos Monkey in 2011. A decade later, the practice has evolved from a Netflix-specific innovation to a discipline with standardised practices that organisations at various scales can implement.

In 2022, teams using Gremlin or Chaos Toolkit often hit roadblocks when staging environments lacked sufficient traffic to reproduce production-scale failures. For example, a fintech team injected latency into their payment gateway dependency during a game day but missed the cascading timeouts because their staging traffic volume was 60% lower than production. This false negative persisted until they configured a traffic mirror to replicate live traffic patterns in staging, increasing the experiment’s realism by 40%.

Chaos engineering isn’t just about tools—it’s about operational discipline. A retail company with 30 microservices discovered that 75% of their game day findings stemmed from misconfigured health checks. By codifying chaos experiments into their CI/CD pipeline using Spacelift (for Terraform-based infrastructure) and integrating with Datadog for metrics, they reduced mean time to detect (MTTD) for failure modes by 30% over six months.

Azure Chaos Studio, generally available in 2022, provides managed chaos experiments for Azure resources, including VM shutdowns, network partition simulation, high CPU and memory injection, and AKS pod failures. The integration with Azure Monitor means experiment effects are visible in the same dashboards used for production monitoring. For teams that want structured chaos experiments without building their own tooling, Chaos Studio offers a managed path on Azure.

A game day is a planned chaos exercise where the engineering team injects failures and observes the response. Game days are valuable for discovering failure modes and building team confidence in runbooks. However, they are episodic. A system that passes a game day in March might develop a new fragility in June as code changes accumulate. Continuous chaos, running fault injection automatically in staging environments as part of the deployment pipeline, provides ongoing resilience assurance.

The highest-value chaos experiments for most distributed systems involve removing individual service instances, which verifies load balancing and auto-scaling, injecting latency into dependencies to verify timeout and circuit breaker behaviour, and disrupting the connection to the primary database to verify failover and recovery. Start with failure modes that have affected production in the past, using chaos engineering as a regression test for previous incidents.

Chaos engineering has come a long way since its inception at Netflix. It's no longer just about randomly breaking things, but about systematically verifying system behaviour under failure conditions. With tools like Azure Chaos Studio, teams can now run managed chaos experiments and integrate them with their existing monitoring setup.

The key to successful chaos engineering is to focus on learning from the system's response to failures. This involves defining clear hypotheses, injecting failures, and observing the system's behaviour. By doing so, teams can identify potential failure modes and build confidence in their runbooks.

As organisations continue to adopt chaos engineering practices, it's essential to move beyond game days and towards continuous chaos. This involves integrating fault injection into the deployment pipeline and running it automatically in staging environments. By doing so, teams can ensure ongoing resilience assurance and reduce the risk of failures in production.

In practice, chaos engineering can help teams identify weaknesses in their systems and improve their overall resilience. By injecting failures and observing the system's response, teams can gain valuable insights into how their system behaves under stress.

Chaos engineering is not a one-time exercise, but rather an ongoing process that requires continuous effort and attention. By incorporating chaos engineering into their workflow, teams can ensure that their systems are resilient and can withstand failures when they occur.