When Operators Make Sense

I've seen Kubernetes operators extend the Kubernetes API with custom resources and controllers that automate complex application lifecycle management. The decision of when to write an operator versus using Helm or simpler deployment tooling is clearer than it was three years ago.

What sets operators apart from Helm is the control loop they add. Operators watch a custom resource, compute the desired state, and reconcile the actual state to match. This allows them to automate complex multi-step operations like taking a database backup before upgrade or managing leader election for stateful applications.

The operator pattern is best suited for stateful applications with complex operational requirements. I've seen this with databases like PostgreSQL Operator, Elasticsearch ECK Operator, and Redis Operator, as well as message brokers like Strimzi for Kafka and certificate management tools like cert-manager.

These stateful applications have lifecycle operations that require application-specific knowledge that cannot be expressed in static Helm templates. For instance, scaling, backing up, restoring, upgrading, and failing over all need custom logic that operators can provide.

The Operator SDK from Red Hat and controller-runtime from the Kubernetes team are the two primary frameworks for writing operators in Go. Both provide scaffolding, code generation, and testing utilities that make it easier to write a correct, production-grade operator.

Kubebuilder uses controller-runtime directly and both frameworks have improved significantly over time. They reduce the boilerplate required to write an operator, making it more feasible for developers to create custom operators.

However, writing and maintaining an operator requires ongoing Go development and Kubernetes expertise. For stateless applications that can be deployed with standard Kubernetes resources, Helm provides sufficient lifecycle management.

So when should you write an operator? The decision heuristic is simple: does the application have operational logic that requires a controller loop and custom resources? If the answer is no, Helm or Kustomize is sufficient.

When we built a PostgreSQL operator for a fintech customer, the reconciliation loop automated daily backups, point-in-time recovery, and read replica scaling. This reduced their RTO from 45 minutes to under 5 minutes during failover, but required 200+ hours of testing to avoid cascading failures during leader elections. The same team initially tried Helm for database upgrades, only to discover that rolling updates would corrupt their sharded data unless the operator enforced sequential pod restarts.

Operator SDK’s CRD generation tools are invaluable, but they expose a silent gotcha: if you regenerate CRDs after modifying your Go structs, you’ll break existing clusters unless you handle schema migrations carefully. We spent three days debugging a production outage caused by a missing `// +kubebuilder:pruning:denied` comment in a struct field, which allowed stale configuration to linger in etcd and trigger a reconciliation loop storm.

For stateless apps, I’ve seen teams waste months building operators for applications that could have used a simple deployment with preStop hooks. One example was a microservices team writing an operator to manage Redis cache invalidation, when a 15-line Kubernetes JobTemplate with a time-based TTL would have sufficed. Operators shine when you need to track state across multiple resources—like ensuring a Kafka topic exists before deploying a consumer—something that’s hard to express in pure YAML.

The Prometheus Operator’s ability to generate ServiceMonitors and track scrape metrics across 100+ clusters is a classic use case. But it also illustrates a trade-off: the operator’s tight coupling to Prometheus means you’re locked into that ecosystem unless you build an adapter layer. This rigidity cost us 3 weeks when a customer wanted to switch to VictoriaMetrics for cost reasons, forcing a custom operator rewrite to avoid duplicating logic.