Secrets Management at Scale

I've seen it time and time again: hundreds of passwords, API keys, certificates, and tokens scattered across a system, each with its own rotation schedule and access requirements. If you get this wrong, your entire infrastructure is at risk.

As your system grows, the problem only gets worse. What works for a small team or a single environment quickly breaks down when you have multiple teams, environments, and CI/CD pipelines to manage.

The baseline for secrets management is now set by tools like HashiCorp Vault, AWS Secrets Manager, and Azure Key Vault. These systems store secrets, control access, rotate them on schedule, and log everything – and if you're not using one, you probably should be.

When we first moved from a handful of static files to a Vault‑backed store, the first thing that broke at 2 a.m. was the token renewal process. Our initial configuration used the default 32‑day lease on root tokens; the client libraries kept the token in memory and never refreshed it, so after the lease expired every service started throwing authentication errors. The fix was to run a sidecar that periodically calls /auth/token/renew and to set a shorter lease, say 24 hours, with a proactive renewal window of 6 hours. We also had to enable Vault's integrated storage replication across three data centers; the extra hop added about 15 ms latency but prevented a single point of failure that had caused a 12‑minute outage during a network partition.

Rotation is not a nicety, it's a necessity. The sooner you detect a compromised secret and rotate it, the less damage it will cause. Automated rotation is safer than manual, and there's no excuse not to do it regularly.

Different secrets require different handling. API keys from your own infrastructure have different storage and access requirements than customer credentials or service-to-service tokens.

In Kubernetes we tried to rely on native Secrets for everything, but the base64‑encoded blobs are stored in etcd unencrypted by default. The moment we started scaling to 500 pods per service, the load on the API server spiked because each pod was pulling its secret on every restart. We switched to the Vault Agent injector, which writes a short‑lived file into the container and caches the secret for the pod's lifetime. The trade‑off is that you now have to manage the injector's health and make sure the cached file is cleared on pod termination, otherwise you end up with stale credentials lingering for hours.

While storing secrets securely is crucial, it's equally important to maintain an audit trail. You need to know who accessed what secret, when, and from where – that history is your investigation tool in case something goes wrong.

Audit logs can quickly become a data dump. In one deployment we were shipping every read and write event to CloudWatch Logs, which grew to 2 GB per day and cost us $300 a month. We introduced a filter that only logs write operations and failed reads, and we ship the filtered stream to an ELK cluster where we index by secret name and requestor identity. This gave us the ability to run a query like “show all read attempts for the production DB password in the last 24 hours” and spot anomalous patterns without drowning in noise.

A secrets management system without audit logging is like a house without locks – it's not just a matter of when it will be broken into, but how quickly you'll be able to respond to the breach.

The consequences of poor secrets management are severe, ranging from data breaches to compliance failures. It's not a problem you can afford to ignore.

In reality, secrets management is not a one-size-fits-all solution. You need a system that can adapt to different secrets, different teams, and different environments – and that's a challenge that requires careful planning and execution.