Kubernetes Observability

Kubernetes workloads are ephemeral, and the scheduling is dynamic - a far cry from traditional VM monitoring. To address this, we need to rethink how we collect and aggregate metrics.

Four golden signals are essential for Kubernetes: rate, errors, duration, and saturation. These signals apply at multiple layers: cluster, workload, and application. To make sense of these signals, we need dashboards that drill down from cluster health to individual pod behaviour.

In practice, implementing these golden signals requires careful consideration of the trade-offs between metric collection frequency, data retention, and storage costs. For example, collecting metrics every 10 seconds may provide better visibility into transient issues, but it also increases the load on the metrics pipeline and storage requirements. We have seen cases where overly aggressive metric collection has led to performance issues in Prometheus, with query latencies increasing by as much as 30% when dealing with large datasets. Using tools like VictoriaMetrics or Thanos can help mitigate these issues by providing more efficient storage and querying capabilities.

Additionally, when designing our metrics pipeline, we need to consider the limitations of our tools and the potential for bias in our data. For instance, if we are relying on the kubelet for container metrics, we need to be aware that it may not always have visibility into the container's internal state, potentially leading to inaccurate or incomplete metrics. We have seen cases where container crashes or restarts have not been properly captured by the kubelet, resulting in incomplete or misleading metrics. Using additional tools like cAdvisor or container-level monitoring agents can provide more comprehensive visibility into container performance and help identify potential issues before they become incidents.

Another important consideration is the need to correlate metrics across different layers of the system. For example, a spike in errors at the application layer may be related to a saturation issue at the cluster level, or a disk usage issue with a Persistent Volume. Using tools like Grafana or Kibana can help us create dashboards that bring together metrics from different sources and provide a more complete picture of system performance. We have seen cases where correlated metrics have helped us identify issues that would have been difficult or impossible to detect using a single metric or dashboard.

kube-state-metrics provides Prometheus metrics about Kubernetes object states. This includes the number of desired versus available replicas, pod states, and PersistentVolumeClaim status. These metrics give us visibility into the Kubernetes control plane's management of workloads.

The kubelet exposes container CPU and memory metrics via the metrics API and Prometheus. Key metrics include container CPU usage, memory in use, and out-of-memory kills. Alerting on OOM events can flag memory limit misconfiguration before it causes production instability.

Furthermore, when monitoring container memory usage, we need to consider the potential for memory leaks or other issues that may not be immediately apparent from the metrics. For example, a container may be using a large amount of memory, but still be within its configured limits. Using tools like sysdig or falco can help us detect anomalous container behavior and identify potential issues before they become incidents. We have seen cases where these tools have helped us detect memory leaks or other issues that would have been difficult or impossible to detect using metrics alone.

Persistent volume monitoring is critical for Stateful workloads with Persistent Volumes. Key metrics include disk usage per PVC and PVC binding status. We need to monitor disk usage and alert with sufficient lead time for PVC expansion before the disk fills and the workload fails.