Cloud Cost Management on Azure and AWS

I've seen cloud cost blowouts in projects that scale quickly, but it's not inevitable. The organisations that manage cloud costs effectively do so with a consistent engineering approach to performance and reliability.

Reserved instances and savings plans are the biggest levers for reducing cloud compute costs. By committing to a specific VM SKU or spend level for 1-3 years, you can get 40-70% discounts over on-demand pricing. To make this work, you need to accurately forecast workloads, right-size before committing, and have a process for converting on-demand usage to commitments as new workloads stabilise.

For example, I worked on a project where we used AWS Cost Explorer to monitor our usage and identify opportunities for reserved instance conversions. We were able to convert around 60% of our on-demand usage to reserved instances, resulting in a 55% decrease in our monthly cloud costs. We also used AWS CloudWatch to monitor our instance utilisation and adjust our reserved instance commitments accordingly. This approach allowed us to maintain a high level of availability while minimising our cloud costs.

Another key aspect of cloud cost management is monitoring and optimising storage costs. This can be a significant portion of overall cloud spend, especially for data-intensive workloads. Tools like Azure Storage Analyzer and AWS Storage Lens can help identify areas for cost optimisation, such as deleting unused blobs or transitioning to lower-cost storage tiers. In one case, I saw a project reduce its storage costs by 30% simply by implementing a data lifecycle management policy that automatically transitioned old data to archive storage.

Cloud waste surveys consistently find 25-35% of cloud spend is unused or over-provisioned resources. To right-size, use cloud provider recommendations (AWS Compute Optimizer, Azure Advisor) to identify over-provisioned VMs, identify unused resources (unattached disks, idle load balancers, old snapshots), and right-size database instances based on actual utilisation rather than theoretical peak capacity.

Additionally, implementing a tagging strategy can help with cost allocation and tracking. For instance, using a consistent set of tags across all resources can enable cost allocation by team, department, or project, making it easier to identify areas for cost optimisation. I've seen teams use tools like AWS Tag Editor or Azure Tag Manager to streamline the tagging process and ensure consistency across all resources.

Azure Spot VMs and AWS Spot Instances offer significant discounts (50-90% off on-demand) for workloads that can tolerate interruption. I've seen success with running stateless, fault-tolerant workloads (batch jobs, CI/CD agents, development clusters) on spot capacity with fallback to on-demand or reserved instances when spot is interrupted. Kubernetes clusters with mixed node pools (spot user nodes, reserved system nodes) automate the fault-tolerant spot usage.

Cost optimisation can't be outsourced to finance. The engineering team that owns a workload should own its cost. FinOps is the cultural and organisational practice of bringing financial accountability to cloud spending: cost allocation by team via tags, per-service cost metrics in team dashboards, engineers participating in monthly cost reviews. The FinOps Foundation was established in 2019 and saw significant adoption through 2021 as cloud bills scaled.