Kubernetes Storage for Stateful Workloads

I quickly learned that running stateful workloads on Kubernetes requires understanding the PersistentVolume subsystem and the StatefulSet workload type. These are more complex than stateless Deployment workloads but necessary for databases, message queues, and other stateful systems.

The storage abstraction in Kubernetes is the PersistentVolume. Requests for storage are made through PersistentVolumeClaims, which bind to PersistentVolumes. With dynamic provisioning, StorageClasses create PersistentVolumes automatically when claims are made. In AKS, the default StorageClass provisions Azure Managed Disks. For workloads that need shared storage across pods, Azure Files StorageClasses provide ReadWriteMany access mode, allowing multiple pods to mount the same volume simultaneously.

In my experience, choosing the right StorageClass is critical for performance. For example, Azure Managed Disks provide high-performance storage, but they are more expensive than Azure Blob Storage. On the other hand, Azure Files StorageClasses are less expensive, but may have higher latency. I have seen deployments where the choice of StorageClass was based on cost, only to find that the performance was not acceptable. For instance, a Cassandra cluster that required low-latency storage was initially deployed with Azure Blob Storage, but had to be migrated to Azure Managed Disks due to high latency issues. This experience taught me to carefully evaluate the trade-offs between cost and performance when selecting a StorageClass.

StatefulSets provide guarantees that Deployments do not. They ensure stable pod identity, with pods named pod-0, pod-1, rather than random suffixes. They also provide stable network identity, with DNS names that persist across pod restarts. Finally, they ensure ordered deployment and scaling, with pod-0 starting before pod-1, which starts before pod-2. These guarantees are required for clustered databases and message queues that need to know which instance is which and need bootstrapping in order.

I have also found that using tools like Prometheus and Grafana to monitor StatefulSet workloads is crucial for understanding their behavior. For example, monitoring the latency of a Kafka cluster can help identify issues with the underlying storage. I have seen cases where high latency in a Kafka cluster was caused by a poorly performing StorageClass, and monitoring the latency helped identify the issue. Additionally, using a tool like Kubernetes Dashboard can provide valuable insights into the state of the pods and the StatefulSet as a whole, making it easier to debug issues and optimize performance.

StatefulSets use headless Services for stable DNS-based pod addressing. A headless Service creates DNS records for each pod individually, rather than a single DNS record for the service IP. Applications that need to address specific replicas directly, such as Cassandra's seed nodes or Kafka's broker IDs, use headless service DNS for peer discovery.

Another important consideration for StatefulSet workloads is the need for reliable backup and restore procedures. I have seen cases where a StatefulSet workload was not properly backed up, resulting in data loss when the underlying storage failed. Using tools like Velero to automate backup and restore procedures can help mitigate this risk. For example, Velero can be configured to take regular snapshots of PersistentVolumes, which can then be used to restore the workload in case of a failure. This experience has taught me to prioritize backup and restore procedures when deploying StatefulSet workloads.

Kubernetes volume snapshots provide point-in-time snapshots of PersistentVolumes. On Azure, these snapshots create Azure Managed Disk snapshots. Velero, from VMware, provides Kubernetes-native backup and restore, using volume snapshots for stateful workload backup. I think production stateful workloads on Kubernetes need a tested backup and restore procedure, because while the Kubernetes control plane can be rebuilt, PV data cannot.