Cloud Infrastructure Pipelines

I've seen many teams struggle with the idea of applying CI/CD principles to infrastructure code. They think it's a luxury they can't afford, or that it's too complex. But the truth is, it's a necessity for cloud-native operations. Without it, you're leaving your infrastructure open to manual errors, misconfigurations, and drift.

A good infrastructure pipeline should start with linting and validation. This means running tools like terraform validate, tflint, and checkov on every pull request. You should also plan in a staging environment and post the plan as a comment for review. Then, apply the changes to staging on merge to main, and run automated smoke tests. Finally, add a manual approval gate before applying the changes to production.

For instance, at one of my previous companies, we had a Terraform module for deploying a Kubernetes cluster. The module was used across multiple environments, and we used tools like TerraTest to validate the module's behavior. We also used environment-specific variable files to customize the deployment for each environment, which made it easier to manage and maintain our infrastructure code.

Environment promotion is another critical aspect of infrastructure CI/CD. Just like application code, infrastructure changes should flow through environments in the same way: dev, staging, production. The same Terraform module should be deployed to each environment, with environment-specific variable files. And just like application code, environment promotion should be controlled by the pipeline.

I've seen teams use tools like Jenkins or GitLab CI/CD to manage their infrastructure pipelines, with mixed results. One of the biggest challenges is managing the trade-offs between pipeline complexity and maintainability. For example, a complex pipeline with many stages and gates can be difficult to debug and maintain, but a simple pipeline may not provide enough validation and testing. In my experience, a good rule of thumb is to aim for a pipeline with 5-7 stages, including linting, validation, staging deployment, smoke testing, and production deployment.

For example, a successful staging deployment with passing automated smoke tests should trigger the production deployment, pending manual approval. This ensures that only reviewed and tested changes make it to production. We used to see around 20-30 deployments per week to our staging environment, and about 5-10 deployments per week to production. This allowed us to move quickly and respond to changing business needs, while still maintaining a high level of quality and reliability.

But what about drift detection? Infrastructure drift occurs when manual changes are made to cloud resources outside the IaC pipeline. To detect this, you should run terraform plan against production environments on a schedule and alert if the plan shows changes. You can also use CloudTrail (AWS) and Azure Activity Log to track manual API changes to cloud resources. In one case, we detected drift in our production environment due to a manual change made by an operator, and we were able to reconcile the drift by updating our Terraform state to match reality.

Reconciling drift requires either updating the Terraform state to match reality or reverting the manual change. This is where the importance of a well-designed pipeline really comes into play. With a good pipeline, you can catch drift before it becomes a problem and ensure that your infrastructure remains consistent and up-to-date. For example, we used to run terraform plan against our production environment every 2 hours, and we would alert the team if any changes were detected. This allowed us to respond quickly to drift and prevent it from becoming a major issue.

One final consideration is secrets in the pipeline. CI/CD pipelines for infrastructure need cloud credentials to deploy, but these secrets can be a security risk if not managed properly. The pattern I've seen work well is to use federated credentials (OIDC) instead of long-lived service principal secrets. This way, the CI/CD platform authenticates to Azure or AWS using its own identity, eliminating the need for secrets that can be leaked from the pipeline configuration.

In practice, this means using tools like GitHub Actions with Azure Workload Identity or GitLab with AWS OIDC. These tools provide a secure way to authenticate to cloud resources without exposing sensitive credentials. By following this pattern, you can ensure that your infrastructure pipeline is secure and reliable, and that your cloud resources are properly managed and up-to-date.