Cloud applications fail. The infrastructure is not as reliable as a well-maintained on-premises data centre for individual operations, but it provides more options for resilience patterns. Designing for failure is the prerequisite for high availability.

The retry pattern with jitter

Transient failures in cloud services are common: network blips, service throttling, momentary unavailability. Retrying failed operations recovers from transient failures. The naive retry (immediate retry in a loop) amplifies load on an already-stressed service. Exponential backoff with jitter (random delay addition) reduces retry storms: the first retry waits 1s, the second 2s (+/- 0.5s jitter), the third 4s, and so on. The Polly library for .NET provides retry, circuit breaker, and timeout policies.

The health endpoint pattern

Every service should expose a health endpoint that returns its operational status. The health check verifies: the application process is running, dependencies (database, cache, downstream services) are reachable, and application-specific readiness conditions are met. The health endpoint is used by load balancers (remove unhealthy instances from rotation), container orchestrators (Kubernetes readiness probes), and monitoring systems (alert on unhealthy status).

Idempotent operations as the safety net

Idempotent operations can be safely retried: executing the same operation multiple times produces the same result as executing it once. Making operations idempotent enables safe retry without duplicate side effects. The implementation: idempotency keys for financial operations, conditional writes (if-not-exists or if-version-matches) for state updates, and deduplication in message consumers. Every operation in a distributed system that may be retried (which is every operation) should be designed for idempotency.

Graceful degradation

A gracefully degrading system provides partial functionality when a dependency is unavailable. An e-commerce checkout that cannot reach the recommendation service still completes the checkout without recommendations. A search page that cannot reach the personalisation service returns unranked results rather than an error page. Graceful degradation requires: identifying which features depend on which services, designing fallback behaviour for each dependency failure, and testing that the fallbacks actually work.