The race to build the biggest model has quietly given way to a different competition: who can build the smallest model that is still actually useful. Google Gemini 1.5 Flash, Meta Llama 3.1 8B, and Microsoft Phi-3 Mini are all winning real production deployments right now, and the reasons come down to simple engineering economics.

The cost problem with large models

GPT-4 class models are extraordinarily capable. They are also expensive to run at scale. If you are building a feature that handles millions of requests per day, even small differences in per-token cost compound into serious infrastructure budget. When companies ran the numbers on deploying GPT-4 for every user interaction, many of them came back with a question: does this actually need GPT-4, or does it need a model that is good enough and costs a tenth as much?

For most tasks, good enough wins. Classifying support tickets, summarising short documents, generating structured output from a template, answering FAQ-style questions from a knowledge base: a well-tuned 8 billion parameter model handles these reliably. You only need GPT-4 scale reasoning for the genuinely hard problems.

Gemini 1.5 Flash

Google released Gemini 1.5 Flash in May and it has become one of the most interesting models in production use. It is optimised for speed and cost rather than maximum capability, but it inherits the long context window from the Pro variant. You can throw 1 million tokens at Flash for a fraction of what Pro costs. For document processing, code review on large codebases, and long-form summarisation, that context window plus lower cost is a genuinely attractive combination.

Phi-3 Mini from Microsoft

Microsoft's Phi-3 Mini is 3.8 billion parameters. It runs comfortably on a laptop CPU or a mobile device. The model was trained with a focus on reasoning capability relative to its size, using a "textbook quality" data curation approach rather than simply scaling up. For on-device AI features, edge deployments, or any scenario where you cannot make a network call, Phi-3 class models are worth evaluating seriously.

The RAG angle

Retrieval-augmented generation changes the small model calculation significantly. If you give a smaller model good context through retrieval, it does not need to hold as much world knowledge in its weights. A 7B model with good RAG plumbing can outperform a 70B model answering from memory alone on domain-specific tasks. That insight is driving a lot of the architecture decisions I am seeing in production systems right now.

The trend is clear. Use the largest model where the complexity justifies it. Use smaller, faster, cheaper models everywhere else. The teams building the most cost-effective AI systems are the ones who applied this discipline early.