Running AKS in Production

Running AKS in production is different from running a sample application. It's the operational patterns that matter most, such as upgrade cadence, node pool design, monitoring, and cost management.

I've found that a good node pool architecture is key to a well-run AKS cluster, where you define the VM size, OS configuration, and autoscaling parameters for a group of nodes, and separate workload types into different node pools.

In a production environment, I separate node pools into a system node pool for Kubernetes system components, separate user node pools for workloads, and potentially a spot instance node pool for batch workloads that can tolerate interruptions, which enables independent scaling and different VM SKUs optimised for each workload type.

For example, I had a cluster with 5 node pools, each with a different instance type. Using the Azure CLI, I was able to define and manage these node pools with ease. When using Azure CNI for networking, I had to ensure that the subnet size was adequate for the number of pods I planned to run, which was around 2000 pods per node pool.

Additionally, when designing node pools, it's essential to consider the trade-off between the number of node pools and the complexity of management, as too many node pools can lead to increased overhead and decreased scalability. Using tools like Azure Policy, I was able to enforce configuration standards across node pools, such as required labels and taints.

AKS supports the three most recent Kubernetes minor versions, which means clusters must be upgraded at least once every 12 months to stay in support. While automatic cluster upgrade is available in preview, most production environments want manual control over upgrade timing.

My upgrade strategy is to upgrade the control plane first, then node pools, and test with non-production workloads before upgrading production. This helps ensure a smooth transition. I've found that using a canary release approach, where a small percentage of traffic is routed to the upgraded cluster, helps to identify and mitigate potential issues, such as incompatibilities with custom components or network policies.

I've also found that monitoring the cluster's performance during the upgrade process is crucial. Using tools like Grafana and Prometheus, I was able to monitor key metrics such as node CPU usage, pod latency, and network throughput. I set alerts for any unexpected changes, which helped to quickly identify and address potential issues.

For monitoring, I use Azure Monitor for containers, which collects CPU, memory, storage, and network metrics from AKS nodes and pods, aggregated by cluster, node, namespace, and container. The Prometheus metrics integration scrapes pod metrics exposed on /metrics endpoints.

Container Insights alerts can notify when node CPU or memory exceeds thresholds that indicate the need to scale or right-size node pools. The integration with Log Analytics provides correlation between cluster metrics and application logs, which helps with troubleshooting. I've found that using a combination of metrics and logs helps to identify issues such as resource bottlenecks, network connectivity problems, and application errors.

When it comes to cost management for AKS, there are three main components to consider: compute, storage, and networking. Optimisation levers include right-sizing node pools, cluster autoscaler for elastic compute, spot instance node pools for fault-tolerant batch workloads, and reserved instances for the baseline node count that runs 24/7.

Using reserved instances, I was able to save around 30% on compute costs, and by right-sizing node pools, I was able to reduce waste and optimise resource utilisation, and using the Azure Cost Estimator, I was able to estimate costs and plan for future growth, which helped to ensure that the cluster was running within budget.

In terms of networking costs, I've found that using Azure Load Balancer and Azure Application Gateway helps to reduce egress costs. By optimising network traffic flow, I was able to reduce costs by around 20%. Using the Azure Network Watcher, I was able to monitor and troubleshoot network issues, which helped to ensure that the cluster was running smoothly and efficiently.