OpenTelemetry in .NET for Distributed Tracing

In 2022, OpenTelemetry became generally available for .NET, offering a significant advantage: the ability to change your observability backend without having to re-instrument your entire application.

OpenTelemetry focuses on standardizing three key observability signals: traces, which map the complete path of a request through distributed services; metrics, which provide numeric measurements over time; and logs, which are discrete events with timestamps and attributes. This standardization allows a single instrumentation library to produce data that any compatible backend can ingest. For instance, you can instrument your application once and then export the data to Jaeger, Zipkin, Datadog, Azure Monitor, or any other OTLP-compatible backend.

For .NET applications, OpenTelemetry offers support for automatic instrumentation of common frameworks such as ASP.NET Core, HttpClient, Entity Framework Core, and several third-party libraries. Adding OpenTelemetry to an ASP.NET Core application is straightforward, requiring just three NuGet packages and a few lines of configuration. This setup can produce distributed traces that include the service name, operation name, HTTP status, and duration, all without needing to touch the application code. In many cases, this auto-instrumentation covers about 80% of the observability value.

I've seen this play out in production, where we've used OpenTelemetry with Azure Monitor and been able to identify performance bottlenecks in our application with ease. For example, in one instance, we were able to reduce the average response time of our API by 30% just by optimizing a single database query that was causing a bottleneck. This was made possible by the detailed tracing information provided by OpenTelemetry, which allowed us to pinpoint the exact source of the issue.

Another example that comes to mind is when we were using OpenTelemetry with Jaeger to monitor a microservices-based application. We were able to use the tracing data to identify a issue with one of the services that was causing a cascading failure, resulting in a significant outage. By using OpenTelemetry, we were able to quickly identify the root cause of the issue and implement a fix, reducing the downtime by several hours. This experience highlights the value of having a standardized observability framework like OpenTelemetry.

When you need to go beyond the auto-instrumented frameworks, .NET's System.Diagnostics.Activity API provides a clean and efficient model for custom instrumentation. By creating an Activity, you effectively create an OpenTelemetry span. Adding tags to the Activity allows you to add span attributes. The Activity API is built into the Base Class Library (BCL) starting from .NET 5, which means there's no dependency on the OpenTelemetry package for custom instrumentation, except for the export configuration. For instance, you can use the Activity API to instrument a custom repository layer in your application, allowing you to track the performance of specific database operations.

In terms of trade-offs, one thing to consider when using OpenTelemetry is the overhead of tracing. While the overhead is generally minimal, it can add up if you're tracing a large number of requests. For example, in one of our applications, we were tracing every single request, which resulted in a significant amount of tracing data being generated. This led to increased storage costs and processing time for the tracing data. To mitigate this, we had to implement sampling, which reduced the amount of tracing data being generated, but also reduced the accuracy of our tracing data. This highlights the need to carefully consider the trade-offs when implementing tracing in your application.

Another useful feature of OpenTelemetry is Baggage propagation. This mechanism allows you to pass key-value context across service boundaries in a distributed request. Common use cases include passing a tenant ID, user ID, or feature flag state. The baggage propagates via HTTP headers and is accessible in every service that handles the request, without needing explicit parameter passing. This is particularly useful for correlation that cuts across service boundaries in a multi-tenant system.

I've also found that using tools like Prometheus and Grafana in conjunction with OpenTelemetry can provide even more value. For example, you can use Prometheus to scrape metrics from your application and then use Grafana to visualize the data. This allows you to get a better understanding of the performance and behavior of your application, and make data-driven decisions to improve it. By combining OpenTelemetry with other tools and technologies, you can create a powerful observability stack that helps you build and operate more reliable and efficient applications.