Microservices Data Patterns

I've seen it time and time again: a monolithic application with a single database, and then the shift to microservices. Suddenly, that database becomes a bottleneck, unable to handle the increased load and complexity of distributed services. The monolith database that served a monolithic application does not serve a distributed services architecture.

In my experience, the simplest way to manage microservices data is for each service to own its data and be the sole access point for it. Other services access the data through the owning service's API, or via event-driven data synchronisation. This pattern, where each service is responsible for its own data, is the baseline for microservices data isolation.

One of the immediate challenges of this approach is replacing queries that joined tables across domain boundaries. These queries must now be replaced with API calls or event-driven data synchronisation. However, the benefit is clear: services can be deployed independently because their data schema changes do not affect other services.

For example, I've seen cases where a simple query that joined two tables in a monolithic database turned into a complex API call that required coordination between three different services. Using tools like Apache Kafka or Amazon Kinesis for event-driven data synchronisation can help to simplify this process, but it still requires careful planning and management. In one instance, we reduced the average query response time from 500 milliseconds to 50 milliseconds by using event-driven data synchronisation with Apache Kafka, but it required significant investment in tooling and training for our development team.

When it comes to distributed transactions, I've found that the saga pattern is often the best solution. This pattern replaces distributed transactions with a sequence of local transactions, each publishing an event to trigger the next step. Compensating transactions undo the work of earlier steps if a later step fails. However, sagas add significant complexity compared to local transactions, and should only be used when the business process genuinely spans service boundaries.

In practice, implementing sagas requires careful consideration of the trade-offs between consistency, availability, and performance. For instance, using a saga to manage a distributed transaction that involves multiple services can provide high consistency, but it may also reduce availability due to the increased complexity of the transaction. In one case, we had to balance the need for consistency in a distributed transaction with the need for high availability, and we ended up using a combination of sagas and local transactions to achieve the desired trade-off. We used a tool like Netflix's Conductor to manage the workflow and ensure that the saga was properly executed.

Another key pattern I've encountered is event-driven data synchronisation. Services that need data from other domains can maintain a local cache derived from events published by the owning service. The Order service, for example, subscribes to Customer events and maintains a local read model of customer data it needs. The cache is eventually consistent, and while this approach has its challenges, it can greatly reduce the need for synchronous service-to-service API calls.

One challenge with event-driven data synchronisation is handling event replay for cache rebuild, which can be particularly difficult when dealing with high-volume event streams. For example, if an event stream has 100,000 events per second, rebuilding a cache can take significant time and resources. To address this challenge, we used a tool like Apache Flink to manage the event stream and provide a scalable solution for event replay. We also had to carefully manage schema evolution in events to ensure that the local cache remained consistent with the owning service's data.

One of the biggest challenges of event-driven data synchronisation is handling event replay for cache rebuild, managing schema evolution in events, and accepting eventual consistency. However, when done correctly, this approach can provide a significant performance boost and improve the overall resilience of the system.

Finally, I've found that the CQRS read model is often the best solution for microservices queries that span multiple services. This pattern involves a dedicated query service that subscribes to events from multiple upstream services and maintains a denormalised read model optimised for the query. The query service is a read-only projection, and while it requires careful management to maintain eventual consistency between the upstream write models and the query service's read model, it can greatly simplify complex queries and improve overall system performance.