Running AKS in production is different from running a sample application, and it's the operational patterns that matter most, such as upgrade cadence, node pool design, monitoring, and cost management

I've found that a good node pool architecture is key to a well-run AKS cluster, where you define the VM size, OS configuration, and autoscaling parameters for a group of nodes, and separate workload types into different node pools

In a production environment, I separate node pools into a system node pool for Kubernetes system components, separate user node pools for workloads, and potentially a spot instance node pool for batch workloads that can tolerate interruptions, which enables independent scaling and different VM SKUs optimised for each workload type

For example, I had a cluster with 5 node pools, each with a different instance type, and using the Azure CLI, I was able to define and manage these node pools with ease, and when using Azure CNI for networking, I had to ensure that the subnet size was adequate for the number of pods I planned to run, which was around 2000 pods per node pool

Additionally, when designing node pools, it's essential to consider the trade-off between the number of node pools and the complexity of management, as too many node pools can lead to increased overhead and decreased scalability, and using tools like Azure Policy, I was able to enforce configuration standards across node pools, such as required labels and taints

AKS supports the three most recent Kubernetes minor versions, which means clusters must be upgraded at least once every 12 months to stay in support, and while automatic cluster upgrade is available in preview, most production environments want manual control over upgrade timing

My upgrade strategy is to upgrade the control plane first, then node pools, and test with non-production workloads before upgrading production, which helps ensure a smooth transition, and I've found that using a canary release approach, where a small percentage of traffic is routed to the upgraded cluster, helps to identify and mitigate potential issues, such as incompatibilities with custom components or network policies

I've also found that monitoring the cluster's performance during the upgrade process is crucial, and using tools like Grafana and Prometheus, I was able to monitor key metrics such as node CPU usage, pod latency, and network throughput, and set alerts for any unexpected changes, which helped to quickly identify and address potential issues

For monitoring, I use Azure Monitor for containers, which collects CPU, memory, storage, and network metrics from AKS nodes and pods, aggregated by cluster, node, namespace, and container, and the Prometheus metrics integration scrapes pod metrics exposed on /metrics endpoints

Container Insights alerts can notify when node CPU or memory exceeds thresholds that indicate the need to scale or right-size node pools, and the integration with Log Analytics provides correlation between cluster metrics and application logs, which helps with troubleshooting, and I've found that using a combination of metrics and logs helps to identify issues such as resource bottlenecks, network connectivity problems, and application errors

When it comes to cost management for AKS, there are three main components to consider: compute, storage, and networking, and optimisation levers include right-sizing node pools, cluster autoscaler for elastic compute, spot instance node pools for fault-tolerant batch workloads, and reserved instances for the baseline node count that runs 24/7

Using reserved instances, I was able to save around 30% on compute costs, and by right-sizing node pools, I was able to reduce waste and optimise resource utilisation, and using the Azure Cost Estimator, I was able to estimate costs and plan for future growth, which helped to ensure that the cluster was running within budget

In terms of networking costs, I've found that using Azure Load Balancer and Azure Application Gateway helps to reduce egress costs, and by optimising network traffic flow, I was able to reduce costs by around 20%, and using the Azure Network Watcher, I was able to monitor and troubleshoot network issues, which helped to ensure that the cluster was running smoothly and efficiently