Google launched Gemini on December 6th with benchmark results showing Gemini Ultra surpassing GPT-4 on MMLU and other evaluations. The launch came with controversy about how benchmarks were measured, but the underlying capability is real.

The three-tier family

Gemini comes in three sizes: Ultra, Pro, and Nano. Ultra is the frontier model intended for data centre use. Pro powers Bard (renamed Google Gemini) and the Google Cloud API. Nano runs on Pixel 8 Pro devices. The architecture mirrors what other labs have done: a capable small model for edge deployment, a mid-tier model for most API use, and a frontier model for benchmarking and complex tasks.

The benchmark controversy

Google's original demo video appeared to show Gemini responding in real time to live video and audio, creating the impression of a continuous multimodal agent. The video was edited from still images and selected outputs. This followed a pattern of AI demos that show capabilities in their best light rather than representative performance. The actual Gemini Pro API that launched was notably less impressive than the demo suggested, at par with GPT-3.5 Turbo rather than the implied GPT-4 competitor.

Native multimodality

Gemini was designed from the ground up as a multimodal model, trained on text, images, audio, and video simultaneously, unlike GPT-4 which added vision capabilities to a text model. Whether native multimodality produces meaningfully better performance on mixed-modality tasks than post-hoc vision addition is an open research question. The architecture advantage is cleaner, but capability is what matters in production.

What it means for Google Cloud

Google Cloud customers now have access to Gemini Pro through Vertex AI. The Azure vs Google Cloud AI capability race is now on. Google's advantage is integration with its own services: Search, Workspace, YouTube. For enterprises already in Google Cloud, Gemini accessibility through Vertex AI without leaving the GCP compliance boundary is significant.