Fault Tolerant Microservices

When it comes to distributed systems, resilience patterns are well-established in theory, but their production implementation is a different story. As the service mesh grows to dozens or hundreds of services, consistent application of these patterns is crucial.

The circuit breaker pattern prevents cascading failures by opening the circuit when a downstream service starts failing. This prevents subsequent calls from attempting the downstream call, allowing the system to fail fast. However, the circuit breaker requires tuning, including the threshold for opening, the timeout before half-open, and the error types that count as failures. Polly, Hystrix, and Resilience4j implement this pattern.

In my experience the hardest part is finding the sweet spot for those thresholds. With Hystrix we started with a 5 % error rate over a 10‑second window and a 30‑second half‑open timeout, but the service flapped whenever a brief GC pause pushed the error rate just over the limit. We ended up adding a 2‑second grace period and moving the window to 30 seconds, which cut the false positives by half while still protecting the downstream service during real outages. The metrics dashboard showed the open‑circuit count dropping from dozens per hour to single digits after the change.

Bulkheads and isolation limit the blast radius of a failing component by isolating resources. This is achieved through thread pools, semaphores, and connection pool limits per downstream dependency. In Kubernetes, resource quotas and limits per namespace serve as the bulkhead at the infrastructure level.

We tried to rely solely on namespace quotas and quickly learned why. A single microservice that opened 200 HTTP connections to a database exhausted the pod's connection pool, and every other service that shared the same namespace started seeing 503 errors. The fix was to add per‑deployment connection pool limits in Istio's DestinationRule and to set a 100 mCPU request with a 200 mCPU limit per pod. That way the noisy neighbor stayed confined and the rest of the mesh kept serving traffic.

Retrying failed requests recovers from transient failures, but a naive implementation can turn a brief service hiccup into a thundering herd problem. Exponential backoff with jitter reduces retry storms by waiting 2^n seconds plus a random jitter before retrying. However, the retry policy must be calibrated to the expected transient failure duration, as too few retries can leave recoverable errors unrecovered, while too many can overload the failing service.

One lesson that sticks with me is that retries must be idempotent. We had a POST endpoint that wrote a row to MySQL; a client retry on a timeout caused duplicate rows because the operation was not safe to repeat. The fix was to move the write behind a Kafka topic and make the consumer handle at‑least‑once delivery, which eliminated the duplicate problem. After we limited retries to three attempts with a 150‑ms jitter, the observed success rate rose to 99.9 % and the added latency stayed under 200 ms.

Every service call must have a timeout to prevent slow downstream services from holding goroutines or threads indefinitely. The timeout should be set based on the expected response time plus a margin, not the maximum theoretical response time. Services with multi-second p99 latencies create timeout chains that cascade through the call graph. Measuring actual downstream latencies and setting timeouts at 2-3x the p95 is a reasonable starting point.