Distributed tracing is no longer optional for organisations running microservices. The visibility it provides into cross-service request flows is essential for debugging production issues that span multiple services.

The basics of distributed tracing

A distributed trace tracks a request as it flows through multiple services. Each service adds a span to the trace: a record of the service name, operation, start time, duration, and relevant tags. Spans are linked by trace context (a trace ID and parent span ID) propagated in request headers. The assembled trace shows the full call tree for a request, including which services were called, in what order, and how long each took.

OpenTelemetry as the standard

OpenTelemetry (OTEL), the merger of OpenTracing and OpenCensus under the CNCF umbrella, became the de-facto standard for distributed tracing instrumentation in 2021. The SDK provides automatic instrumentation for common frameworks (.NET, Node.js, Java, Python, Go) that instruments HTTP clients, databases, and message queue clients without code changes. The OTEL collector aggregates telemetry and routes it to the backend of choice.

Sampling strategies

Tracing every request at full fidelity produces enormous data volumes and storage costs. Sampling reduces the trace volume. Head-based sampling decides at the entry point whether to trace a request (simple, cheap, misses important tail-latency events). Tail-based sampling buffers spans and makes the sampling decision after the full trace is assembled (captures anomalies at the cost of buffering latency and complexity). Adaptive sampling adjusts rates based on error rates and latency outliers.

The Jaeger and Zipkin backends

Jaeger (CNCF-graduated) and Zipkin (Twitter-originated) are the two open-source distributed tracing backends. Jaeger provides a query UI for trace exploration, dependency graph visualisation, and trace comparison. Jaeger's deployment model supports both all-in-one for development and a distributed production deployment with separate collector, query, and storage components. Elasticsearch and Cassandra are the supported production storage backends.