Running IaC in production has shed light on the operational challenges that remain in 2020. State management, the hard part of Infrastructure as Code, is where most of these challenges arise.
The state file, where Terraform maps your configuration to real-world resources, is where corruption, concurrent applies, and drift between the two occur. To prevent this, use remote state with locking, never edit state files manually, and run regular Terraform plan runs to detect drift.
I've seen teams store the state file in an S3 bucket with DynamoDB for lock management, and the first time the lock table was mis‑provisioned the apply hung for thirty minutes, blocking the entire pipeline. The fix was to enable TTL on the lock items and to set a reasonable timeout in the backend config. In practice you also want versioning on the bucket so you can roll back a corrupted state file; we once restored a state from a previous version after a power‑out caused a partial write, saving weeks of re‑creation work.
Immutable infrastructure, not mutable configuration management, has emerged as the consensus in 2020. Building new AMIs or container images for every change and replacing old instances with new ones ensures consistency by construction. Packer for image building and blue-green deployment for rollout are the tools that support this approach.
Terraform code requires the same testing as application code. Terratest and kitchen-terraform can deploy real infrastructure, run assertions, and tear it down for integration testing. Unit testing frameworks like Checkov and tfsec test for security misconfigurations without deploying.
In our CI we run Terratest suites on every pull request, but we learned that spinning up a full VPC for each run quickly exhausted our account quota. The compromise was to use a shared test VPC and to tag resources with the PR number, then tear them down with a single `terraform destroy` at the end of the job. This added about five minutes to the pipeline but prevented the occasional 429 errors from the provider API.
Without governance, IaC can lead to infrastructure sprawl: hundreds of environments for development that never get cleaned up, redundant modules with slight variations, and configuration drift between identical environments. This requires organisational discipline similar to cloud cost management, with tagging standards, automated cost reporting, and regular environment cleanup processes.
Version pinning turned out to be a silent source of drift. A module we pulled from the public registry upgraded from Terraform 0.11 to 0.12, and the implicit upgrade changed the naming of a security group, causing a cascade of failures in downstream environments. By adding a `required_version` constraint and using the `~>` operator for module versions we locked the code to a known good release and avoided surprise breakages during a sprint.
The lessons from running IaC in production highlight the need for a governance framework that includes state management disciplines, immutable infrastructure, testing, and sprawl management. This framework is crucial for ensuring the reliability and efficiency of IaC in production environments.
The current state of IaC adoption in production requires a shift from initial enthusiasm to a more mature approach. The operational challenges that arise from state management, immutable infrastructure, testing, and sprawl are real and need to be addressed with a clear governance framework.
The maturity of IaC adoption in production is a double-edged sword. While it brings clearer lessons from running IaC at scale, it also highlights the operational challenges that remain. Addressing these challenges requires a deep understanding of the tools and techniques involved in IaC.
The discipline required for IaC governance is similar to that required for cloud cost management. This includes tagging standards, automated cost reporting, and regular environment cleanup processes. Without this discipline, IaC can lead to infrastructure sprawl and operational challenges.
The current state of IaC adoption in production requires a more mature approach to state management, immutable infrastructure, testing, and sprawl management. This approach involves a clear governance framework that addresses the operational challenges that arise from running IaC at scale.