Observing Azure Workloads with Monitor and Log Analytics

Azure Monitor, with Log Analytics Workspace as its data store and Kusto Query Language as its query engine, is the standard observability platform for Azure workloads. It’s where you go to understand what’s happening in your cloud applications and infrastructure.

Kusto Query Language is central to Azure Monitor. Its pipe-based syntax, like table | where | summarize | render, makes writing queries for logs and telemetry concise and powerful. KQL is specifically built for log analytics, offering optimized capabilities for full‑text search, time‑series summarization, joining data from different telemetry sources, and even geospatial queries.

To effectively use Azure observability, you need to learn KQL. The dashboards and alerts you build in the Azure portal are all powered by KQL queries. Without understanding KQL, you're essentially flying blind.

For application telemetry, Application Insights is the go-to. It can auto-instrument applications written in .NET, Node.js, Python, and Java. It collects key metrics like HTTP request rates, latency, and failures, along with dependency calls to databases or other services, exception tracking, and custom events you define.

In practice, the biggest challenge with Application Insights is the volume of telemetry. In a microservices deployment with 50 services, a 100 requests per second baseline can produce over 200,000 events per day. If you enable all dependency traces, the ingestion cost can exceed $200 per month, and the query latency will increase because the data set grows. We usually gate the level of detail by using sampling at 10% or 5% for production traffic, while keeping full detail on a staging environment. Sampling is configured in the SDK and can be tuned per environment; the trade‑off is that you lose the ability to see rare error patterns in the sampled data.

Another pitfall is the default retention period. Log Analytics keeps data for 30 days by default, but many compliance regimes require 90 or 365 days. Extending retention to 90 days adds roughly 20% to the storage cost, and to 365 days the cost can double. We mitigate this by creating separate workspaces for production and dev, and using the 'archive' feature to move older data to cheaper blob storage via the Data Export API. In production, we keep the first 90 days in the primary workspace for quick queries, then export the rest to a cold tier.

When you start wiring diagnostics from AKS, you’ll notice that the kube‑audit logs are the most verbose. A single node can generate 10 MB of audit logs per hour if you keep the default verbosity. We usually drop the 'watch' and 'list' verbs for non‑critical resources, and enable only 'create', 'update', and 'delete' for RBAC changes. This reduces the log volume by 70% and keeps the ingestion cost manageable. On the other hand, if you need to debug a security incident, you temporarily lift the filter and capture the full audit stream for a few hours.

The Application Map feature in Application Insights visualizes the connections between your services based on the telemetry it collects. Beyond that, availability tests, which are HTTP ping tests run from multiple Azure regions, allow you to monitor your application's endpoint availability from the outside.

Every Azure resource has the ability to emit diagnostic logs and metrics. You can send these directly to a Log Analytics Workspace using diagnostic settings. For example, enabling these settings on AKS can capture kube‑audit and cluster‑autoscaler logs, while for Azure SQL, you can collect query performance and blocking information.

Similarly, Azure Front Door can send access logs and health probe logs to Log Analytics. The baseline for a well‑configured Azure environment is to have diagnostic settings enabled on all critical resources and to configure data retention to meet any compliance needs.

Azure Monitor alerts can be configured to trigger based on metric thresholds, the results of log queries, or specific resource health events. Action groups define what happens when an alert fires, such as sending an email, invoking a webhook, calling a Logic App for automated remediation, or creating an incident in an ITSM tool.

While static metric alerts are straightforward, they can be problematic during normal workload fluctuations. Dynamic thresholds, which use machine learning to adapt alert thresholds based on historical patterns, are much better at reducing false positives caused by expected traffic spikes.