I recall the excitement a year ago when models could hold a million tokens in context. That's about 750,000 words or ten average novels sitting in a single prompt. The demos were impressive, and researchers posted benchmarks, but soon teams realized that having a massive context window and knowing what to do with it are two different problems.
I'm not dismissing the capability; a million tokens in context is a real technical achievement. However, I think there's a version of the conversation happening right now that treats window size as the finish line, and that's worth pushing back on.
The pattern I've seen play out is that a team gets access to a long-context model, loads in a large document or codebase, sends a query, and gets back results that are okay, sometimes good, but often frustratingly hard to diagnose. The model technically saw everything in the prompt, but whether it used the right parts is a different question entirely.
Researchers have identified a phenomenon called 'lost in the middle,' where models tend to pay disproportionate attention to content at the beginning and end of a context window, underweighting material in the middle. So if you're feeding in a 200-page document and the critical detail is on page 94, you might not get the answer you're looking for.
This is why retrieval-augmented generation hasn't gone away, even as context windows have grown. Targeted retrieval gives you more control over what the model works with, producing more consistent results. However, RAG introduces its own set of problems, such as maintaining a chunking strategy, embedding model, vector store, and retrieval pipeline.
Long context does handle well a specific class of tasks where the signal isn't concentrated in one place and relationships between parts of the document matter. Examples include reviewing code across an entire repository, analyzing a contract, or summarizing a long research transcript.
The cost dimension is also worth discussing honestly. Long-context inference is expensive, with input tokens adding up fast. A lot of teams have gone through a phase of enthusiasm about long context, done the cost modeling, and quietly landed back on retrieval-based approaches as more economical and predictable.
There's also a latency component; long prompts take longer to process, adding friction for interactive applications. For batch workflows, it matters less, but it's another variable that doesn't appear in benchmark numbers.
The capability is moving forward, with context windows growing and models getting better at using what's in them. There's active research on improving mid-context attention and helping models navigate long inputs more reliably.
Right now, there's a meaningful gap between the spec sheet and what you can depend on in production. Teams doing the most interesting work aren't treating large context windows as a solved input problem; they're being deliberate about what goes in and building evaluation pipelines for long-context failure modes.