AWS, Azure, and Google Cloud all reported strong Q1 2024 earnings driven by AI workloads. Beneath the revenue lines is a genuine infrastructure build-out that is reshaping what data centres look like.

The GPU capacity crunch

Every major hyperscaler spent 2023 and early 2024 in a GPU capacity crunch. The H100 backlog at Nvidia stretched to 6-12 months for some customers. AWS, Azure, and Google consumed the majority of production capacity to stock their own data centres. That is why every cloud provider built customer-facing GPU cluster products: to give developers access to hardware they could not buy themselves.

Custom silicon is the long game

All three clouds have custom AI accelerators in production. Google's TPUs have been in use since 2016 and power Gemini inference at scale. AWS Trainium and Inferentia are designed for training and inference respectively, with pricing structured to undercut Nvidia-based instances for sustained workloads. Azure's Maia 100 chip and its Cobalt ARM CPU are now in production. Custom silicon gives hyperscalers control over their cost per token, which is the metric that determines AI product economics.

Networking is the actual bottleneck

Training large models requires moving hundreds of terabytes between GPUs thousands of times per run. The networking fabric connecting GPUs matters as much as the GPUs themselves. Google's Tensor Processing Unit pods use custom high-bandwidth interconnects. AWS uses Elastic Fabric Adapter and custom network switching. Azure's Eagle network and InfiniBand clusters are purpose-built for distributed training. This is not commodity data centre networking. It is a custom infrastructure layer that took years to build and is very hard to replicate.

What this means for cloud strategy

The infrastructure gap between hyperscalers and on-premises is growing. The cost and complexity of building the GPU clusters, custom networking, and custom silicon that power competitive AI workloads is beyond most enterprises. The hyperscaler model of renting this infrastructure is becoming more compelling, not less, as AI workloads grow. The strategic question is no longer whether to use cloud for AI but which cloud's specific AI infrastructure investments align with your stack.