Google announced Gemini 1.5 Pro on February 15th with a one million token context window in the research preview. One million tokens is roughly 750,000 words. The practical implications are only starting to be worked through.
What one million tokens unlocks
Previous practical context limits were 8K to 32K tokens for most models, with Claude 2 at 100K and some models reaching 200K. One million tokens is an order of magnitude beyond the previous practical ceiling. At this scale you can load entire codebases, full transcripts of long meetings, complete legal case files, or thousands of documents simultaneously. The retrieval-augmented generation pattern, where you retrieve relevant chunks and stuff them into context, becomes less necessary for many applications when the entire corpus fits in context.
The needle-in-a-haystack result
Google's needle-in-a-haystack evaluation inserts a specific sentence into a very long document and asks the model to find it. Gemini 1.5 Pro scored 99% accuracy across 1 million token contexts. That is a different result from previous long-context models that showed significant degradation above 32K tokens. Whether this holds in production applications is what research previews are for, but the benchmark result is credible.
The mixture-of-experts architecture
Gemini 1.5 uses a mixture-of-experts architecture, the same approach as Mixtral. Only a subset of the model's parameters are active for any given input. This is how Google can offer a million-token context window at a latency and cost that is not prohibitive. The full weight of the model is not activated for every token in that million-token window. Routing logic determines which experts handle which parts of the input.
What developers should test
The research preview access for 1M tokens is limited and latency is higher than production APIs. But the right thing to do with early access is test your actual use case, not the published benchmark. If you have a retrieval-intensive application, test whether a long-context model eliminates your retrieval complexity. If you have a document processing workflow, test whether loading the full document changes accuracy on your specific task. The production answer is always in your data.