Kubernetes observability is its own discipline. It's different from monitoring your application. You're monitoring infrastructure, workloads, and the cluster itself. The good news is the tooling has matured enough that you don't have to build it from scratch.
Prometheus is the standard
Everyone uses Prometheus for Kubernetes. The pull model, where Prometheus scrapes targets over HTTP, fits naturally with Kubernetes service discovery. You add annotations and Prometheus finds your pods automatically. No manual configuration. kube-state-metrics gives you cluster state metrics: pod counts, deployment availability, resource requests versus limits. Node Exporter gives you host metrics. This is your foundation.
Grafana builds the dashboard layer
Grafana sits on top of Prometheus, Loki, and Tempo, giving you dashboards, alerting, and multi-source queries. The Grafana community publishes ready-to-use Kubernetes dashboards at grafana.com/dashboards for cluster health, individual nodes, namespaces, workloads. Start with these. Customize them for the signals that matter to your team. Add application-level dashboards alongside infrastructure dashboards. The investment is in making them actually useful for your context.
Loki for logs
Grafana Loki is Prometheus-inspired log aggregation. It uses the same label model as Prometheus. Logs get stored as compressed streams, indexed only by labels, not by content. This makes Loki dramatically cheaper to run at scale than Elasticsearch. You lose some full-text search power. But for structured log queries with LogQL, it's competitive with Elasticsearch for most production workloads.
Alert on error budgets, not symptoms
The mistake I see constantly: alert on every metric that can possibly go wrong. You end up with so many alerts that the oncall engineer ignores them all because they're mostly noise. Do this instead: define your SLOs (availability, latency, error rate). Create multi-burn-rate alerts that fire when you're consuming your error budget dangerously fast. Treat symptom-based alerts as debugging tools, not oncall alerts. Your oncall engineer will actually respond to alerts that matter.