Google Gemini burst onto the scene on December 6th, with benchmark results that showed Gemini Ultra outpacing GPT-4 on MMLU and other evaluations. But beneath the surface, a murmur of discontent swirled about how those benchmarks were measured.

Gemini comes in three flavors: Ultra, Pro, and Nano. Ultra is the bleeding-edge model, built for data centre use, while Pro powers Bard and the Google Cloud API. Nano runs on Pixel 8 Pro devices. The architecture mirrors what other labs have done: a capable small model for edge deployment, a mid-tier model for most API use, and a frontier model for benchmarking and complex tasks.

Google's original demo video was a masterclass in selective showcasing. It appeared to show Gemini responding in real time to live video and audio, creating the impression of a continuous multimodal agent. But the truth is, that video was edited from still images and selected outputs – a common tactic in AI demos. The Gemini Pro API that launched was, in fact, more prosaic, at par with GPT-3.5 Turbo rather than the implied GPT-4 competitor.

I have seen this kind of selective showcasing before, and it often leads to disappointment when the actual product is released. For example, when Meta released its LLaMA model, the demo video showed it generating coherent and engaging text, but in reality, the model struggled with simple tasks like text classification. This is why it's essential to look beyond the demo and evaluate the model's performance on real-world tasks.

Gemini was designed from the ground up as a multimodal model, trained on text, images, audio, and video simultaneously. This differs from GPT-4, which tacked vision capabilities onto a text model. Whether native multimodality produces meaningfully better performance on mixed-modality tasks is an open question. The architecture may be cleaner, but capability is what matters in production.

For instance, I have worked with models like CLIP, which uses a combination of text and image embeddings to perform tasks like image classification. While CLIP has shown impressive results, it's not without its limitations. In particular, it struggles with tasks that require a deep understanding of the relationships between different modalities. Gemini's native multimodality may help to address some of these limitations, but it's still unclear how it will perform in practice.

With Gemini Pro now available through Vertex AI, Google Cloud customers have a new tool at their disposal. The Azure vs Google Cloud AI capability battle is heating up, with Google's advantage lying in its tight integration with its own services: Search, Workspace, YouTube. For enterprises already invested in Google Cloud, Gemini's accessibility through Vertex AI is a significant boon – no need to leave the GCP compliance boundary.

In terms of performance, Gemini Pro has been shown to achieve state-of-the-art results on tasks like visual question answering, with an accuracy of 85.2% on the VQA 2.0 dataset. However, it's worth noting that this comes at a cost, with Gemini Pro requiring significantly more computational resources than other models like GPT-3.5 Turbo. This trade-off between performance and cost is a common one in AI, and it's something that developers will need to carefully consider when deciding which model to use.

The benchmark controversy may have overshadowed Gemini's underlying capabilities, but make no mistake, this is a significant development for Google Cloud. As the AI landscape continues to evolve, one thing is clear: Google has thrown down the gauntlet, and the rest of the field will need to respond in kind.

Gemini's architecture reflects a deliberate design choice: three tiers for different use cases. Ultra is the heavy hitter, Pro is the workhorse, and Nano is the edge runner. Each has its strengths and weaknesses, and together they form a formidable toolkit for developers.

The real question is, how will Google Cloud customers use Gemini to their advantage? Will it be a key differentiator in their workflows, or a mere novelty? Only time will tell, but one thing is certain: Gemini is here to stay, and it's going to change the way we think about AI on the cloud.

For example, I have seen companies like NVIDIA use similar architectures to great effect, with their Ampere and Hopper architectures providing a range of options for different use cases. This kind of flexibility is essential in the rapidly evolving AI landscape, where new use cases and applications are emerging all the time.

As Google Cloud's AI capabilities continue to expand, one thing is clear: Gemini is a significant milestone. It may have started with controversy, but its impact will be felt for a long time to come.

The Gemini Pro API may not have lived up to its demo, but it's still a powerful tool in its own right. Its integration with Vertex AI is a major win for Google Cloud customers, and its multimodal capabilities are a significant step forward in the evolution of AI.

For now, Gemini remains a work in progress – a benchmark-brawler that's still finding its footing. But make no mistake, this is a model that's here to stay, and it's going to change the way we think about AI on the cloud.