Mistral released Mixtral 8x7B on December 11th as a torrent link with no announcement, no blog post. Just a magnet link on X. The model benchmarks showed it outperforming Llama 2 70B at roughly one-third the inference cost.

How mixture-of-experts works

A standard dense transformer activates all its parameters for every input token. A mixture-of-experts model routes each token through a subset of specialised sub-networks called experts. Mixtral has 8 experts per layer and routes each token to 2 of them. The total parameter count is 46.7B but only 12.9B are active for any given token. This gives you the knowledge capacity of a 46.7B model at the inference cost of a 12.9B model.

The benchmark results

Mixtral 8x7B matches or beats Llama 2 70B on most evaluation benchmarks while being 6 times faster at inference. On coding benchmarks it significantly outperforms Llama 2 70B. Given that Llama 2 70B was already the leading open source model for most tasks, Mixtral 8x7B represents a step-change in open source capability per compute dollar.

The fully open release

Mistral released Mixtral under the Apache 2.0 licence: completely free for commercial use, no restrictions. The release method, a torrent link with no blog post, was a deliberate statement. Mistral is positioning itself as the open alternative to both proprietary API providers and the more restricted open source releases.

Enterprise implications

Mixtral 8x7B running on two A100 80GB GPUs in 8-bit quantisation. That is data centre hardware that many organisations have available. The performance per dollar for classification, summarisation, and extraction tasks is now firmly in the range where building on open source and running on your own infrastructure outcompetes API pricing for sustained workloads. The economics of the open source vs proprietary API decision changed in December 2023.