AKS in Production

I've seen AKS mature considerably since its preview in 2017, and by 2021 it's a production-ready Kubernetes service with a well-understood set of operational patterns. The rough edges of early AKS have been smoothed out, and the remaining challenges are Kubernetes challenges, not AKS-specific.

Node pool architecture in AKS allows for different VM SKUs for different workload types within a cluster. A common pattern is to have a system node pool with smaller VMs for Kubernetes system components, and one or more user node pools for application workloads. For example, CPU-intensive ML workloads can go on GPU node pools, while spot instance node pools can reduce cost significantly for batch or fault-tolerant workloads.

In one production cluster I've worked with, we saw a 30% reduction in costs by using spot instances for batch workloads, and a 25% increase in throughput for ML workloads on GPU node pools. This was achieved by carefully monitoring resource utilization and adjusting node pool sizes accordingly, using tools like Prometheus and Grafana to track performance metrics.

In 2021, managed identities were the auth model of choice for AKS, giving pods access to Azure resources without embedding credentials. This was achieved through pod identity, which has since been superseded by Workload Identity in 2022. AKS workload identity allows pods to authenticate as Azure managed identities, getting scoped access to Azure Key Vault, Storage, SQL, and other services, eliminating the need for client secrets in environment variables or Kubernetes secrets.

Private cluster networking is a must for production AKS clusters. This means the Kubernetes API server has no public IP endpoint, and API access requires network adjacency, such as VPN, ExpressRoute, or Azure Bastion. By combining this with Azure Container Registry with private endpoint and an internal load balancer for ingress, a properly configured production AKS cluster can have no publicly accessible surfaces except those explicitly required. For example, we used Azure Firewall to restrict incoming traffic to the cluster, and Azure Monitor to track network security group rules and detect potential security threats.

A trade-off to consider when implementing private cluster networking is the added complexity of managing network policies and security groups. In one case, we spent several hours debugging a connectivity issue that turned out to be caused by a misconfigured network security group rule. However, the benefits of private cluster networking far outweigh the costs, and tools like Azure Network Watcher can help simplify network troubleshooting.

AKS supports in-place cluster upgrades and node pool upgrades, which is crucial for maintaining a production-ready service. The operational pattern that works at scale is to run node pools on N-1 of the latest Kubernetes version, upgrade the system node pool first, validate, and then upgrade user node pools. This approach ensures a smooth upgrade process. We've also found that using tools like kubectl and kubeadm can help automate the upgrade process and reduce downtime.

To ensure rolling upgrades do not violate minimum availability, PodDisruptionBudgets can be used. Additionally, automating upgrade validation with a canary environment that tracks the current version and fails if post-upgrade smoke tests do not pass can provide an extra layer of assurance. For instance, we used a canary environment with a subset of pods to test upgrades before rolling them out to the entire cluster, using tools like Istio and Linkerd to manage traffic and detect potential issues.

I've found that a well-configured production AKS cluster can be highly secure and efficient. By using private cluster networking, managed identities, and node pool architecture, you can create a robust and scalable Kubernetes service that meets the needs of your organization. With careful planning and monitoring, production AKS clusters can achieve uptime of 99.99% or higher, and support thousands of concurrent users.

The key to a successful AKS deployment is to understand the operational patterns that work at scale. By following these patterns and using the right tools and features, you can create a production-ready Kubernetes service that is both secure and efficient.