Distributed systems engineering has a set of hard-won principles that distinguish systems that remain operable at scale from systems that accumulate operational debt.
In a distributed system, partial failure is the normal case. Some components are slow or unavailable while others are healthy. Systems that treat partial failure as exceptional compound failures. Instead, systems designed for partial failure implement timeouts on every external call, circuit breakers that prevent overloading failing services, bulkheads that limit the blast radius of a single dependency failure, and graceful degradation for non-critical features.
For instance, I recall an incident with a popular e-commerce platform where a third-party payment gateway went down, causing the platform to accumulate a large number of failed payment attempts. The platform's failure to implement circuit breakers and bulkheads resulted in a cascading failure that took hours to resolve. This incident highlighted the importance of designing for partial failure and implementing robust circuit breakers and bulkheads.
Every service interaction has failure modes. Making them explicit in code, runbooks, and operational procedures prevents the surprises that cause extended outages. For every external call, document the failure modes and the application's response to each. Code that handles these explicitly is more resilient and more debuggable.
Measuring the right metrics is crucial for understanding the behavior of a distributed system. For example, I worked with a team that implemented a distributed logging system. Initially, they measured the latency within each service, but this didn't provide insights into the end-to-end user experience. They later added metrics for end-to-end latency and error rates, which revealed issues with the system's performance and helped them optimize it.
You can't improve what you don't measure. For distributed systems, measure the end-to-end latency that users experience, not just the latency within each service. Measure the error rate for each user-visible operation, not just internal error rates. Measure the correlation between internal system health and user experience. The measurements that matter are those that change when user experience changes.
Complex distributed protocols are difficult to implement correctly and expensive to debug. Prefer simpler protocols with well-understood semantics. At-most-once delivery with explicit idempotency at the application layer, at-least-once delivery with application-level deduplication, or asynchronous messaging with poison message handling are good options. The simplest protocol that meets the requirements has the fewest failure modes.
For example, we used Apache Kafka for messaging in a high-throughput system. We chose it because of its simplicity and well-understood semantics. However, we had to implement additional features to handle poison messages, which added complexity to the system. In hindsight, we could have chosen a simpler messaging system like RabbitMQ, which has built-in support for poison message handling.
Designing for partial failure means anticipating that some parts of the system will fail. This approach helps prevent cascading failures and makes the system more resilient.
Explicitly handling failure modes helps prevent surprises during outages. It's essential to document and implement these failure modes in code and operational procedures.
Simpler protocols are easier to implement and debug. They also have fewer failure modes, making the system more reliable.