Resilience patterns for distributed systems are well-established in theory. The production implementation requires consistent application across a service mesh that grows to dozens or hundreds of services.
Circuit breakers
The circuit breaker pattern prevents cascading failures: when a downstream service starts failing, the circuit opens and subsequent calls fail fast without attempting the downstream call. After a configured time, the circuit moves to half-open and allows a test request through. If the test succeeds, the circuit closes. Polly (for .NET), Hystrix (for JVM), and Resilience4j implement the pattern. The circuit breaker requires tuning: the threshold for opening, the timeout before half-open, and the error types that count as failures.
Bulkheads and isolation
The bulkhead pattern limits the blast radius of a failing component by isolating resources. Thread pools, semaphores, and connection pool limits per downstream dependency ensure that exhausting the connection pool to one service does not exhaust the connection pool available to all other services. In Kubernetes, resource quotas and limits per namespace are the bulkhead at the infrastructure level.
Retry policies with exponential backoff
Retrying failed requests recovers from transient failures. The naive implementation (retry immediately on failure) turns a brief service hiccup into a thundering herd problem. Exponential backoff with jitter (wait 2^n seconds plus a random jitter before retrying) reduces retry storms. The retry policy must be calibrated to the expected transient failure duration: too few retries leaves recoverable errors unrecovered; too many overload the failing service.
Timeouts everywhere
Every service call must have a timeout. Without explicit timeouts, a slow downstream service can hold goroutines or threads indefinitely, eventually exhausting the thread pool. The timeout should be set based on the expected response time plus a margin, not the maximum theoretical response time. Services with multi-second p99 latencies create timeout chains that cascade through the call graph. Measuring actual downstream latencies and setting timeouts at 2-3x the p95 is a reasonable starting point.