Pandemic Cloud Rush

I've been thinking about the COVID-19 pandemic's impact on cloud adoption and it's clear that 2020 to 2021 was a period of rapid growth. Two years later, the architectures built during this time are now in production and it's interesting to see which patterns have held up.

What really accelerated during this period was remote collaboration infrastructure, the migration of on-premises applications to cloud SaaS, and video conferencing capacity. These were projects that organisations had planned for years but executed in weeks out of necessity, revealing how much of the previous digital transformation slowness was organisational rather than technical.

The speed at which applications and infrastructure were built to handle pandemic demand often meant taking shortcuts, such as hardcoded configurations, insufficient error handling, minimal observability, and security design that was revisited later. Post-pandemic engineering work has involved refactoring these rapid deployments into maintainable architectures, adding skipped monitoring, and addressing security gaps.

For example, I recall working with a client who had quickly deployed a video conferencing solution using Amazon Chime and Zoom, but had not properly implemented monitoring using tools like New Relic or Datadog. As a result, they experienced significant performance issues when usage spiked, and had to scramble to add monitoring and alerting to prevent downtime. This experience is not unique, and many organisations have had to go back and add proper monitoring and observability to their rapid deployments.

Additionally, the use of cloud-based services like AWS Lambda and Azure Functions has required organisations to rethink their approach to error handling and debugging. With serverless architectures, it's not always easy to reproduce errors or debug issues, so organisations have had to invest in tools like AWS X-Ray or Azure Monitor to get visibility into their serverless applications. This has been a significant change for many engineering teams, who are used to having more control over their infrastructure.

The permanent shift to remote or hybrid work has changed the tooling that engineering teams use. Development environments have moved from local machines to cloud development environments like GitHub Codespaces and Gitpod. Documentation practices have improved because in-person knowledge transfer is no longer available, and code review has become more thorough as it's now the primary communication channel between distributed team members.

Some organisations have also started to use tools like Terraform or CloudFormation to manage their cloud infrastructure, which has helped to improve consistency and reduce errors. However, this has also introduced new challenges, such as managing state and dependencies between different cloud services. I've seen organisations struggle with this, particularly when they have complex architectures that involve multiple cloud providers.

The trade-off between using cloud-based services and managing infrastructure in-house is a classic one, and the pandemic has forced many organisations to re-evaluate their approach. On the one hand, cloud-based services offer scalability and convenience, but on the other hand, they can be expensive and inflexible. I've seen organisations save significant amounts of money by moving from always-on instances to auto-scaling, but this requires careful planning and monitoring to ensure that the scaling is done correctly.

However, the cloud capacity provisioned for pandemic demand was sized for peak load, and as usage patterns normalised, organisations faced cloud bills that reflected pandemic-era capacity.

This has led to post-pandemic cloud cost optimisation projects becoming a significant line item for CIOs. The engineering work involved in this includes right-sizing compute instances, removing unused resources, moving from always-on to auto-scaling, and implementing reserved instances for predictable baseline workloads.

For instance, one organisation I worked with was able to reduce their cloud costs by 30% by rightsizing their compute instances and implementing auto-scaling. This required significant engineering effort, but the cost savings were well worth it. Another organisation was able to reduce their costs by 25% by implementing reserved instances for their baseline workloads.

I've seen organisations tackle these challenges in different ways, but the common thread is a focus on optimising cloud costs while maintaining the agility and flexibility that the pandemic forced them to adopt. It's been a challenging but beneficial process for many organisations.

The pandemic may be over, but its impact on cloud architecture will be felt for years to come. As organisations continue to navigate the post-pandemic landscape, it's clear that the changes brought about by the pandemic have created a new normal for cloud adoption and engineering practices.