Kubernetes Networking Choices and Pitfalls

Kubernetes networking hides complexity by design. Any CNI plugin can provide pod networking, but choosing the right one defines your cluster's security posture and resource usage.

For instance, I recall a production cluster where we used Azure CNI. We had to meticulously plan IP allocation, as it directly mapped pod IPs to Azure VNet. It sped up routing but we burned through IP space faster than anticipated. We had to implement a multi-tier IP allocation strategy to ensure we didn't run out of IPs. In contrast, Kubenet overlays IPs to save address blocks, but it breaks direct VNet routing. The trade-off between IP space and routing efficiency was a constant consideration.

Azure CNI maps pod IPs directly to Azure VNet, which speeds up routing but burns through IP space. Kubenet overlays IPs to save address blocks, but breaks direct VNet routing. Calico adds policy enforcement to networking. Cilium replaces iptables with eBPF for faster policy checks.

When implementing microsegmentation, we started with a default-deny rule at a major financial services company. No network access unless you grant it explicitly. We created allow policies for each permitted communication path between namespaces and pods. It was a painstaking process but ensured that our cluster's security posture was significantly improved. I recall we had over 2000 policies to manage across 50 namespaces.

Microsegmentation starts with a default-deny rule. No network access unless you grant it explicitly. Create allow policies for each permitted communication path between namespaces and pods.

CoreDNS can bottleneck quickly. We saw this firsthand in a large-scale e-commerce deployment. Add caching with the cache plugin, set ndots:5 in resolv.conf to avoid unnecessary external queries, and scale CoreDNS replicas with your cluster size—the default 2 replicas choke clusters over 500 nodes. For example, we scaled CoreDNS to 8 replicas for a 1000-node cluster and saw a significant reduction in latency.

CoreDNS can bottleneck quickly. Add caching with the cache plugin, set ndots:5 in resolv.conf to avoid unnecessary external queries, and scale CoreDNS replicas with your cluster size—the default 2 replicas choke clusters over 500 nodes.

Kubernetes Services are your gateway out—but they're not all the same. ClusterIP stays internal, NodePort forwards host ports, and LoadBalancer uses cloud provider infrastructure. Ingress controllers like Nginx or Traefik add HTTP routing on top. We used Nginx for a SaaS application and had to carefully configure rate limiting and WAF integration to protect against abuse.

TLS termination isn't optional for public endpoints. Use cert-manager for automatic certificate management. Rate limiting and WAF integration protect against abuse. Ingress controllers without these become attack vectors.

Network Policies and Service definitions often diverge in practice. Audit your allow rules quarterly. Misconfigured policies let traffic through while logs show blocked connections, creating false security. I recall a situation where a misconfigured policy allowed traffic on a non-standard port, which was only discovered during a quarterly audit.

Test policy enforcement with pod-to-pod traffic from all network planes. Simulate egress to external services too. Policies that block internal traffic but ignore outbound paths leave escape routes for compromised pods.