I've found that Kubernetes observability is a unique beast, different from monitoring your application, you're dealing with infrastructure, workloads, and the cluster itself, but fortunately the tooling has come a long way so you don't have to start from scratch

Prometheus is the de facto standard for Kubernetes, its pull model fits naturally with Kubernetes service discovery, you add annotations and Prometheus automatically finds your pods, no manual configuration needed, and with kube-state-metrics and Node Exporter you get a solid foundation for cluster state and host metrics

I ran the Prometheus Operator in a 200‑node cluster with about 4 000 pods and quickly learned that the default scrape interval of 15 seconds was eating more than a gigabyte of RAM on the server. Tightening the interval to 30 seconds for low‑frequency services and using relabel rules to drop unused metrics cut the memory footprint in half. Adding a Thanos sidecar let us ship raw blocks to S3 and keep a 30‑day retention without blowing local disks, but the extra network traffic meant we had to provision a dedicated bandwidth slice or risk back‑pressure on the scrape jobs.

Grafana builds on top of Prometheus, Loki, and Tempo, providing dashboards, alerting, and multi-source queries, the community has already done some of the work for you with pre-built Kubernetes dashboards on grafana.com/dashboards, start with those and customize them to fit your team's needs

Loki is a Prometheus-inspired log aggregation tool, using the same label model, logs are stored as compressed streams, indexed only by labels, making it much cheaper to run at scale than Elasticsearch, you do lose some full-text search power, but for structured log queries with LogQL it's competitive with Elasticsearch for most production workloads

Deploying Loki at scale forced me to rethink label design. In one deployment we indexed every pod name, namespace, and container image tag, which resulted in a label cardinality of over 200 k and the query planner started timing out on simple LogQL filters. Switching to a model that only kept service and environment labels, and moving the raw chunk storage to an S3 bucket, reduced the index size by 70 percent and restored sub‑second query latency. The trade‑off is that you lose the ability to search by arbitrary pod name, so we added a small sidecar that writes a separate index for debugging rare cases.

I see a common mistake in alerting, where every possible metric is alerted on, resulting in a noise problem, the oncall engineer ends up ignoring them all, instead define your SLOs, create multi-burn-rate alerts that fire when you're consuming your error budget too fast, treat symptom-based alerts as debugging tools, not oncall alerts

The Alertmanager configuration is where the noise either dies or explodes. In a production run we were getting 250 alerts per minute during a rolling upgrade, most of them transient CPU spikes. By defining inhibition rules that silence high‑severity alerts when a lower‑severity upgrade alert is firing, and by grouping alerts by service and severity, we trimmed the on‑call stream to under 15 actionable alerts per hour. The downside is you have to maintain the inhibition matrix, otherwise a critical failure can be hidden behind a benign upgrade alert.

The key to effective alerting is to focus on what matters, error budgets, not symptoms, by doing so you'll get alerts that your oncall engineer will actually respond to, rather than ignoring them due to noise

When setting up your Kubernetes observability stack, start with Prometheus, Grafana, and Loki, these tools provide a solid foundation for monitoring your cluster, workloads, and applications, and don't forget to customize your dashboards to fit your team's needs

With a well-set-up observability stack, you'll be able to respond to issues quickly, and make data-driven decisions to improve your applications and services, it's not just about monitoring, it's about understanding your system and making it better

In my experience, a good Kubernetes observability setup is crucial for the success of your applications and services, it's not something you can just bolt on later, it needs to be part of your design from the start, and with the right tools and mindset, you can achieve a high level of visibility and control