Jensen Huang introduced the Blackwell GPU architecture at GTC 2024 on March 18th, showcasing B100, B200, and the GB200 NVL72 rack-scale system with specs that warrant a closer look.
The B200 GPU boasts 20 petaFLOPS of FP4 AI performance, a significant upgrade from the H100's 4 petaFLOPS of FP8. Nvidia claims the B200 offers up to 30 times the inference performance of the H100 for large language models. The GB200 NVL72, which combines 36 Grace CPUs and 72 B200 GPUs in a single rack, delivers 1.4 exaFLOPS of AI compute. These specs were mere theoretical projections just two years ago.
Memory bandwidth has been a major constraint on LLM inference performance. Moving model weights between HBM memory and compute units is a time-consuming process. The B200 addresses this with 192GB of HBM3e memory. In the NVL72 configuration, the NVLink interconnect enables all 72 GPUs to share a 13.5 terabyte pool of memory with 1.2 terabytes per second of bandwidth. This allows models that wouldn't fit in a single GPU's memory to reside in the shared NVLink pool, served with hardware-level memory coherency.
In my experience with large-scale deployments, memory bandwidth bottlenecks are a common issue. I recall a project where we had to optimize data transfer between GPUs and memory, which ended up taking more than 30% of our overall processing time. With the B200's improved memory bandwidth, we could potentially reduce that overhead significantly. For instance, using the NVL72 configuration, we could utilize the 13.5 terabyte pool of memory to store model weights and reduce data transfer times.
AWS, Google Cloud, and Azure have all announced partnerships with Nvidia for Blackwell. The first cloud instances became available in late 2024. While pricing for B200 instances will be substantially higher than H100, Nvidia's inference performance claims suggest a better cost-per-token ratio for suitable use cases. The economics of LLM inference infrastructure hinge on throughput per dollar, not just raw GPU price.
When evaluating the cost-effectiveness of Blackwell instances, it's essential to consider the specific use case and model requirements. For example, in a recent project, we found that the H100 was sufficient for smaller models, but as the models grew in size, the B200's improved inference performance became more cost-effective. We estimated that for large language models, the B200 could reduce inference costs by up to 20% compared to the H100.
Nvidia also announced Rubin, the next architecture after Blackwell, at GTC. Jensen Huang committed to a cadence of one new architecture per year. The pace of hardware capability advancements is a strategic variable in itself. Enterprises investing in GPU infrastructure are also buying into a depreciation schedule that the software ecosystem may or may not keep pace with.
The introduction of Blackwell and the roadmap for future architectures signal a significant shift in AI infrastructure capabilities. With these advancements, enterprises will need to carefully consider their investments in GPU infrastructure and how they align with their AI strategies.
As AI models continue to grow in complexity, the demand for high-performance infrastructure will only increase. Nvidia's Blackwell architecture and future roadmap announcements aim to address this demand, but it remains to be seen how the ecosystem will evolve to support these advancements.
The impact of Blackwell on the AI infrastructure landscape will depend on various factors, including adoption rates, performance gains, and cost-effectiveness. One thing is certain, however: Nvidia's latest architecture has set a new standard for AI infrastructure, and competitors will need to keep pace.