Mixtral Beats Llama 2

Mistral dropped Mixtral 8x7B on December 11th, no fanfare, just a magnet link on X. The model's benchmarks showed it outperforming Llama 2 70B at roughly one-third the inference cost, which got my attention.

I was curious how mixture-of-experts works, so I dug in. A standard dense transformer activates all its parameters for every input token, but a mixture-of-experts model routes each token through a subset of specialised sub-networks called experts. Mixtral has 8 experts per layer and routes each token to 2 of them.

The routing mechanism itself introduces overhead—calculating gate logits for 8 experts per token adds ~10% to the compute cost compared to a dense model. But this is offset by the fact that only 2/8 experts are activated per token, reducing memory bandwidth pressure. We used DeepSpeed-MoE's top-2 routing with a load-balancing loss to prevent experts from getting starved. The per-layer routing decisions also require careful tensor parallelism to avoid GPU memory fragmentation.

The total parameter count of Mixtral is 46.7B, but only 12.9B are active for any given token. This gives you the knowledge capacity of a 46.7B model at the inference cost of a 12.9B model, which is a big deal. The effective parameter count is similar to Llama 2 34B, but with better scaling due to the MoE structure's ability to specialise experts for different input patterns.

The benchmark results are impressive. Mixtral 8x7B matches or beats Llama 2 70B on most evaluation benchmarks while being 6 times faster at inference. On coding benchmarks it significantly outperforms Llama 2 70B, which was already the leading open source model for most tasks. For example, on HumanEval it achieves 75.4% pass@1 vs Llama 2 70B's 68.2%, and on MBPP it reaches 81.3% vs 74.6%.

Given that Llama 2 70B was the gold standard, Mixtral 8x7B represents a step-change in open source capability per compute dollar. This changes the game for organisations looking to build on open source and run on their own infrastructure. The key enabler is the model's compatibility with 8-bit quantization—on A100 GPUs, this reduces memory usage by ~40% compared to 16-bit, allowing two full model copies to fit in 80GB. This enables speculative execution and caching optimizations in real-time applications.

Mistral released Mixtral under the Apache 2.0 licence, completely free for commercial use, no restrictions. The release method, a torrent link with no blog post, was a deliberate statement. Mistral is positioning itself as the open alternative to proprietary API providers and more restricted open source releases.

I ran some numbers on Mixtral 8x7B running on two A100 80GB GPUs in 8-bit quantisation, which is data centre hardware many organisations have available. The performance per dollar for classification, summarisation, and extraction tasks is now firmly in the range where building on open source and running on your own infrastructure outcompetes API pricing for sustained workloads. For example, on AWS p3.8xlarge instances, we saw $0.0012 per 1,000 tokens vs Anthropic's Claude API at $0.0035 per 1,000 tokens for similar outputs.

The economics of the open source vs proprietary API decision just changed in December 2023. Organisations can now get better performance at lower cost by building on open source, which is a big shift. I'm excited to see how this plays out.