I've seen distributed tracing become a must-have for organisations running microservices. The insight it provides into how requests flow across services is crucial for debugging issues that affect multiple services.
So how does distributed tracing work? A distributed trace tracks a request as it passes through multiple services, with each service adding its own record of the operation, start time, duration, and relevant tags. These records are linked together using trace context, which is propagated in request headers.
OpenTelemetry has become the standard for distributed tracing instrumentation. It was formed by merging OpenTracing and OpenCensus under the CNCF umbrella and provides automatic instrumentation for common frameworks like .NET, Node.js, Java, Python, and Go. This means you can instrument HTTP clients, databases, and message queue clients without making code changes.
The OpenTelemetry collector aggregates telemetry data and routes it to your chosen backend. This makes it easy to get started with distributed tracing, as you can use the SDK to instrument your services and then send the data to the backend of your choice.
One of the challenges with distributed tracing is the volume of data it produces. To reduce this volume, you can use sampling strategies. Head-based sampling decides whether to trace a request at the entry point, while tail-based sampling buffers spans and makes the decision after the full trace is assembled. There's also adaptive sampling, which adjusts its rates based on error rates and latency outliers.
Head-based sampling is simpler to implement but risks missing critical traces if the initial service doesn't capture them. For example, we once missed a 500ms latency spike in a downstream service because the head-based sample rate was set to 1% and the error occurred in a 0.01% tail. Tail-based sampling avoids this but requires buffering spans in memory, which can increase latency and memory usage by 20-30% in high-throughput systems. Adaptive sampling, used in production at scale, often falls short if error thresholds aren't tuned carefully—teams typically spend 2-3 days calibrating these thresholds to avoid false positives.
When it comes to storing and querying your tracing data, you have a couple of options. Jaeger and Zipkin are two popular open-source distributed tracing backends. Jaeger provides a query UI for exploring traces, visualising dependency graphs, and comparing traces. It also supports both all-in-one and distributed deployment models.
Jaeger's deployment model is flexible, allowing you to use an all-in-one setup for development and a distributed setup for production. In production, you can separate the collector, query, and storage components, and use Elasticsearch or Cassandra as your storage backend. This makes it easy to scale your tracing setup as your services grow. However, teams often underestimate the storage cost—Elasticsearch clusters for Jaeger typically require 3-5x the data size in disk space due to indexing overhead. Cassandra-based setups are cheaper at scale but introduce 20-30% higher query latency.
Zipkin is another option for storing and querying your tracing data. While it doesn't have all the features of Jaeger, it's still a popular choice for many organisations. Ultimately, the choice between Jaeger and Zipkin will depend on your specific needs and requirements. For example, Zipkin struggles with traces exceeding 10,000 spans, whereas Jaeger handles this with a 10-15% performance drop. We once migrated from Zipkin to Jaeger because our API traces had 30,000+ spans due to nested message queues.
As I've seen in my own experience, distributed tracing is a powerful tool for understanding how your services interact with each other. By using a combination of OpenTelemetry, Jaeger or Zipkin, and a sampling strategy, you can get the insight you need to debug production issues and improve the performance of your microservices.