Kubernetes introduces new observability requirements that differ from traditional VM monitoring. The workload is ephemeral, the scheduling is dynamic, and the abstractions (Deployment, ReplicaSet, Pod) are layered in ways that require appropriate aggregation.

The four golden signals for Kubernetes

Rate (requests per second), Errors (error rate), Duration (latency), and Saturation (resource utilisation) apply at multiple layers in Kubernetes: the cluster level (node CPU/memory saturation), the workload level (pod error rate, latency), and the application level (business-level request rates). Dashboards that aggregate these signals at each level provide a drill-down path from cluster health to individual pod behaviour.

kube-state-metrics for object state

The kube-state-metrics exporter provides Prometheus metrics about Kubernetes object states: the number of desired vs available replicas in a Deployment, the state of each Pod (Running/Pending/Failed), PersistentVolumeClaim bound/unbound status, and CronJob last schedule time. These metrics provide visibility into the Kubernetes control plane's management of workloads rather than the workloads' own resource utilisation.

Container resource metrics

The kubelet exposes container CPU and memory metrics via the metrics API and Prometheus. Key metrics: container_cpu_usage_seconds_total (actual CPU usage), container_memory_working_set_bytes (memory in use, more accurate than RSS for limits purposes), and container_oom_events_total (out-of-memory kills). Alerting on OOM events immediately flags memory limit misconfiguration before it causes production instability.

Persistent volume monitoring

PersistentVolumes and StorageClasses need monitoring that traditional infrastructure monitoring does not cover. Key metrics: kubelet_volume_stats_capacity_bytes vs kubelet_volume_stats_used_bytes (disk usage per PVC), and PVC binding status from kube-state-metrics. Stateful workloads (databases, message queues) with PVCs need disk usage alerting with sufficient lead time for PVC expansion before the disk fills and the workload fails.