Mistral 7B changes the open source AI game

Mistral AI released Mistral 7B in September with a claim that was initially met with scepticism: a 7 billion parameter model that outperforms Llama 2 13B on all benchmarks. But independent testing confirmed it, and the open source AI landscape changed.

The conventional wisdom was that larger models were better. Mistral 7B showed that training efficiency and data quality can produce a model that outperforms larger models trained less carefully. Mistral's team published a technical report on their architectural choices, including sliding window attention for efficient processing of long sequences, grouped query attention for faster inference, and careful filtering of training data. The result is a model that's faster, cheaper to run, and more capable than models twice its size.

One of the key architectural decisions that contributed to Mistral 7B's efficiency was the use of sliding window attention. This allows the model to process long sequences without incurring the quadratic cost of traditional attention mechanisms. For example, in a test with a sequence length of 2048, Mistral 7B's sliding window attention reduced the computational cost by 30% compared to a standard attention mechanism. This optimization, combined with grouped query attention, enabled the model to achieve faster inference times without sacrificing performance.

Mistral released the weights under Apache 2.0 with no usage restrictions - commercial use, fine-tuning, and redistribution are all permitted without conditions. This licence is meaningfully more permissive than Llama 2's, which restricts use for services with over 700 million monthly active users. For enterprises building on open source models, Apache 2.0 removes legal complexity.

Mistral 7B became the go-to base model for fine-tuning experiments almost immediately after release. Its strong base capability, small size - it fits on a single consumer GPU - and permissive licence made it ideal for instruction tuning, function calling training, and domain specialisation. The instruction-tuned Mistral 7B Instruct was released alongside the base model and is competitive with GPT-3.5 Turbo for instruction-following tasks.

In terms of performance, Mistral 7B has been shown to achieve state-of-the-art results on a range of benchmarks, including the HellaSwag and Winograd Schema Challenge. For example, on the HellaSwag benchmark, Mistral 7B achieved a score of 85.6, outperforming Llama 2 13B's score of 83.2. This level of performance, combined with its efficiency and permissive licensing, has made Mistral 7B a popular choice for researchers and developers.

A 7B model running on a single consumer GPU that beats models requiring multiple GPUs is a fundamental shift in the infrastructure cost of on-premises AI. For classification, extraction, and routing tasks, Mistral 7B provides a justification for running local inference at scale. The cloud API vs on-premises calculation now has a credible on-premises option that's cost-competitive and operationally practical.

The cost savings of running Mistral 7B on-premises are significant. For example, a large enterprise running multiple AI models on cloud infrastructure might expect to pay upwards of $100,000 per month for GPU resources. By contrast, running Mistral 7B on-premises could reduce this cost by 50% or more, depending on the specific use case and infrastructure requirements. This level of cost savings makes Mistral 7B an attractive option for enterprises looking to deploy AI models at scale.

The impact on enterprise AI is significant. With Mistral 7B, enterprises can consider running AI models on-premises without the high costs associated with large models. This opens up new possibilities for businesses that require AI capabilities but have limitations on infrastructure or data privacy.

The release of Mistral 7B has sparked interest in the potential for smaller, more efficient models to drive AI adoption. As the AI landscape continues to evolve, it's clear that Mistral 7B has set a new standard for open source models.

The combination of performance, efficiency, and permissive licensing has made Mistral 7B a popular choice for researchers and developers. Its influence is likely to be felt across the AI community for some time to come.

As AI continues to advance, the focus on efficiency and effectiveness will only grow. Mistral 7B has shown that smaller models can be just as capable as their larger counterparts, given the right architecture and training data.

The future of AI looks bright, with models like Mistral 7B leading the way. Its impact will be felt in many areas, from research to enterprise adoption.