How to size Kubernetes clusters so you don't overpay

When a cluster boots, every node feels the same weight, but the pods that sit on it may not use that weight.

Kubernetes schedules by the request you declare, not by what the pod actually eats. A pod that asks for four CPUs will land on a node with four free CPUs even if it only burns 200 milli‑CPUs.

If you over‑request for safety, nodes fill up faster and the bill climbs. The cost inflates because the scheduler thinks the pod needs the full request, not the little it really needs.

For instance, I have seen cases where a pod requests 2 CPUs but only uses 0.5 CPUs on average, resulting in a 75 percent waste of resources. Using tools like Prometheus and Grafana to monitor pod usage can help identify such cases and adjust the requests accordingly. I recall a specific case where adjusting the requests of a few pods resulted in a 30 percent reduction in node count without affecting performance.

The Vertical Pod Autoscaler in recommendation mode scans usage and spits out suggested request and limit values. In production I run it that way, watch the numbers, then update the manifests. Auto mode is a quick‑fire that can kick pods off the node.

Another approach is to use the Kubernetes Metrics Server to collect resource usage data and make informed decisions about node sizing. For example, if a node pool is consistently running at 50 percent CPU utilization, it may be possible to reduce the node count or switch to a smaller instance type, such as from an n1-standard-8 to an n1-standard-4, resulting in significant cost savings. I have seen cases where this approach has led to a 25 percent reduction in costs without affecting performance.

Quotas cap the total CPU and memory a namespace can consume, while LimitRanges set per‑pod defaults and caps. By making the default request mandatory, teams see the cost footprint of every workload. Without quotas, one team can grow a pod until it swallows a node, leaving nothing for others.

The VM SKU for a node pool should match the dominant workload. A Java service with a 2 GB heap belongs on a memory‑optimized instance, while a Spark batch job belongs on a compute‑optimized one. Mixing them in a general‑purpose pool forces one side to sit idle. For example, using a memory-optimized instance like an n1-highmem-4 for a Java service can result in a 20 percent reduction in costs compared to using a general-purpose instance like an n1-standard-4.

Managing several node pools adds operational overhead, but the savings outweigh the cost. In my experience the extra cost of a second pool is less than the wasted CPU or memory on a single pool. Using tools like Kubernetes Cluster Autoscaler and Node Auto Provisioning can help reduce this overhead by automatically scaling node pools based on workload demand.