The AI model landscape has changed. The focus is no longer on building the largest model, but on creating smaller models that are still useful. Google's Gemini 1.5 Flash, Meta's Llama 3.1 8B, and Microsoft's Phi-3 Mini are currently winning in real production deployments. The reasons for their success come down to simple engineering economics.
Large models like GPT-4 are capable but expensive to run at scale. For features handling millions of requests daily, small differences in per-token cost add up to serious infrastructure expenses. Companies have found that deploying GPT-4 for every user interaction may not be necessary. A model that is good enough and costs less can be a better choice.
For many tasks, a model that is good enough will win. Tasks like classifying support tickets, summarizing short documents, generating structured output from a template, and answering FAQ-style questions from a knowledge base can be handled reliably by a well-tuned 8 billion parameter model. Only the genuinely hard problems require GPT-4 scale reasoning.
In practice, this means that the cost of running a large model can be 5 to 10 times higher than a smaller model, depending on the specific use case and the efficiency of the deployment. For example, a company like Amazon may need to process tens of millions of product reviews daily. Using a smaller model like Meta's Llama 3.1 8B can save millions of dollars in infrastructure costs per year, compared to using a larger model like GPT-4. This is a trade-off that companies are willing to make, given the significant cost savings.
Google's Gemini 1.5 Flash, released in May, has become an interesting model in production use. It's optimized for speed and cost rather than maximum capability, but it inherits a long context window from the Pro variant. This makes it suitable for document processing, code review on large codebases, and long-form summarization. The Gemini model is a good example of how a smaller model can be designed to excel in specific tasks, rather than trying to be a general-purpose model like GPT-4.
Microsoft's Phi-3 Mini has 3.8 billion parameters and can run comfortably on a laptop CPU or mobile device. The model was trained with a focus on reasoning capability relative to its size using a 'textbook quality' data curation approach. For on-device AI features, edge deployments, or scenarios where network calls are not possible, Phi-3 class models are worth evaluating. The use of tools like TensorFlow Lite and OpenVINO has also made it easier to deploy these smaller models on a variety of devices, from smartphones to smart home devices.
The use of smaller models also changes the way companies approach model training and deployment. With smaller models, companies can train and deploy multiple models in parallel, each optimized for a specific task or use case. This approach can lead to significant improvements in overall system performance and efficiency. For example, a company like Uber may use a smaller model for simple tasks like language translation, while using a larger model for more complex tasks like conversational dialogue systems. This approach requires careful planning and management, but can lead to significant cost savings and performance improvements.
Retrieval-augmented generation changes the calculation for small models. If a smaller model has good context through retrieval, it doesn't need to hold as much world knowledge in its weights. A 7B model with good RAG plumbing can outperform a 70B model answering from memory alone on domain-specific tasks. This is because the smaller model can focus on learning the specific task or domain, rather than trying to learn general knowledge. Tools like Faiss and Hugging Face's Transformers library have made it easier to implement retrieval-augmented generation in production systems.
The trend is clear. Use the largest model where complexity justifies it, and smaller, faster, cheaper models everywhere else. Teams building the most cost-effective AI systems are those that applied this discipline early.