A year ago, the big news was that a model could hold a million tokens in context. That's roughly 750,000 words, or about ten average novels sitting in a single prompt. The announcements came with impressive demos, researchers posted benchmarks, and for a few weeks it genuinely felt like a threshold had been crossed. Then most teams went back to their actual work and found out pretty quickly that having a massive context window and knowing what to do with it are two completely different problems.

I want to be clear that I'm not dismissing the capability. A million tokens in context is a real technical achievement, and it does unlock things that were not possible before. But there's a version of the conversation happening right now that treats window size as the finish line, and I think that's worth pushing back on a little.

The pattern I keep seeing plays out the same way. A team gets access to a long-context model, loads in a large document, a codebase, or a pile of meeting transcripts, sends a query, and gets back results that are okay. Sometimes genuinely good. Often frustrating in ways that are hard to diagnose. The model technically saw everything in the prompt. Whether it actually used the right parts of it is a different question entirely, and that distinction matters a lot in production.

There's a known phenomenon here that researchers have called "lost in the middle." Models tend to pay disproportionate attention to content at the very beginning and the very end of a context window. Material buried in the middle gets underweighted, sometimes significantly. So if you're feeding in a 200-page document and the critical detail is on page 94, you might not get the answer you're looking for, even though the information was technically present. Scaling the window up doesn't change the underlying attention mechanics. You've given the model more to look at, but not necessarily a better mechanism for finding the right thing to look at.

This is partly why retrieval augmented generation hasn't gone away, even as context windows have grown. The basic argument for RAG, which is that you're better off identifying and retrieving the relevant chunks before they hit the model rather than dumping everything in and hoping, still holds up in a lot of real-world scenarios. Targeted retrieval gives you more control over what the model actually works with, and that control tends to produce more consistent results.

The counterargument is that RAG introduces its own set of problems. You need a chunking strategy that doesn't split concepts across boundaries in unfortunate ways. You need an embedding model and a vector store and a retrieval pipeline, all of which have to be maintained. You need to tune how many chunks you retrieve and how you rank them. When retrieval goes wrong, and it does go wrong, it fails silently in the sense that the model just answers based on whatever it got back, which may not be what the user actually needed. Long context was supposed to be a way around all of that complexity. The honest answer is that it reduces some of it but doesn't eliminate it.

What long context does seem to handle well is a specific class of task where the signal isn't concentrated in one place and the relationships between parts of the document genuinely matter. Reviewing code across an entire repository where you need to understand how a change in one file ripples through ten others. Analyzing a contract where a clause on page four modifies a term defined on page thirty-one. Summarizing a long research transcript where the useful insight only becomes visible when you see how themes accumulate across the whole thing. These are cases where breadth is doing real work, not just sitting there.

The cost dimension of this is also worth talking about honestly, because it tends to get glossed over in the capability announcements. Long-context inference is expensive. Input tokens at 500k or a million per request add up fast, and that math gets uncomfortable quickly if you're running it at any kind of scale. A lot of teams I've spoken to have gone through a phase of enthusiasm about long context, done the cost modeling, and quietly landed back on retrieval-based approaches as more economical and more predictable, even accounting for the upfront engineering investment.

There's also a latency component. Long prompts take longer to process, and for anything interactive, that adds friction. For batch workflows it matters less, but it's another variable that doesn't appear in the benchmark numbers.

None of this means the capability isn't moving forward. Context windows will keep growing, and more importantly, the models are getting better at actually using what's in them. There's active research on improving mid-context attention and on helping models navigate long inputs more reliably. The trajectory is real.

But right now, in March 2026, there's a meaningful gap between the number on the spec sheet and what you can actually depend on in production. The teams doing the most interesting work aren't treating a large context window as a solved input problem. They're being deliberate about what goes in, building evaluation pipelines specifically for long-context failure modes, and in some cases combining retrieval and long context in ways that play to the strengths of each.

The capability is genuinely useful. Using it well still requires actual thinking about your specific use case, and that part hasn't been automated away yet.