Jensen Huang showed up the Blackwell GPU architecture at GTC 2024 on March 18th. B100, B200, and the GB200 NVL72 rack-scale system. The numbers are large enough to require a second read.
The raw specs
The B200 GPU delivers 20 petaFLOPS of FP4 AI performance, compared to the H100's 4 petaFLOPS of FP8. Nvidia claims up to 30x the inference performance of H100 for large language models. The GB200 NVL72, which combines 36 Grace CPUs and 72 B200 GPUs in a single rack, delivers 1.4 exaFLOPS of AI compute. These are numbers that were theoretical projections two years ago.
The memory problem solved
One of the main constraints on LLM inference performance is memory bandwidth. Moving model weights between HBM memory and compute units takes time. The B200 has 192GB of HBM3e memory, and the NVLink interconnect in the NVL72 configuration means all 72 GPUs share a 13.5 terabyte pool of memory with 1.2 terabytes per second of bandwidth between them. A model that would not fit in a single GPU's memory can now live in the shared NVLink pool and be served with hardware-level memory coherency.
Cloud availability and pricing
AWS, Google Cloud, and Azure all announced Blackwell partnerships at GTC. The first cloud instances became available in late 2024. Pricing for B200 instances will be substantially higher than H100, but Nvidia's inference performance claims, if they hold at production workloads, suggest a better cost-per-token ratio for the right use cases. The economics of LLM inference infrastructure are not about raw GPU price but about throughput per dollar.
What the roadmap says
Nvidia announced Rubin, the next architecture after Blackwell, at GTC. One new architecture per year is the cadence Jensen committed to. The pace of the hardware capability curve is itself a strategic variable. Every enterprise buying into a GPU infrastructure investment is buying into a depreciation schedule that the software ecosystem may or may not keep pace with.