Running AKS in production is different from running a sample application. The operational patterns that matter: upgrade cadence, node pool design, monitoring, and cost management.

Node pool architecture

AKS node pools define the VM size, OS configuration, and autoscaling parameters for a group of nodes. The production pattern: a system node pool for Kubernetes system components (CoreDNS, metrics-server), separate user node pools for workloads, and potentially a spot instance node pool for batch workloads that can tolerate interruptions. Separating workload types into different node pools enables independent scaling and different VM SKUs optimised for each workload type.

Upgrade cadence

AKS supports the three most recent Kubernetes minor versions. When a new minor version is released, the oldest falls out of support. The practical implication: clusters must be upgraded at least once every 12 months to stay in support. Automatic cluster upgrade (Preview) automates this, but most production environments want manual control over upgrade timing. The upgrade strategy: upgrade the control plane first, then node pools. Test with non-production workloads before upgrading production.

Azure Monitor for containers

Container Insights in Azure Monitor collects CPU, memory, storage, and network metrics from AKS nodes and pods, aggregated by cluster, node, namespace, and container. The Prometheus metrics integration scrapes pod metrics exposed on /metrics endpoints. Container Insights alerts can notify when node CPU or memory exceeds thresholds that indicate the need to scale or right-size node pools. The integration with Log Analytics provides correlation between cluster metrics and application logs.

Cost management for AKS

AKS cost has three components: compute (node VM SKUs and count), storage (persistent volumes), and networking (load balancers, egress). Optimisation levers: right-sizing node pools (monitoring actual vs requested CPU and memory), cluster autoscaler for elastic compute (scale down during off-peak), spot instance node pools for fault-tolerant batch workloads (60-80% discount), and reserved instances for the baseline node count that runs 24/7.