Meta released Code Llama on August 24th, a suite of code-specialised models fine-tuned from Llama 2. At 34B parameters, Code Llama 34B outperforms GPT-3.5 on coding benchmarks and approaches GPT-4 on some tasks.

What Code Llama brings

Code Llama is available in three sizes: 7B, 13B, and 34B. Each size comes in a base version, an instruction-tuned version, and a Python-specialised version. The instruction-tuned versions accept natural language descriptions and generate code. The base versions are designed for integration into coding tools. All versions are fine-tuned for code completion, code explanation, and code debugging.

The benchmark position

Code Llama 34B scores 53.7% on HumanEval, the benchmark of Python programming problems. GPT-3.5 scores 48.1%. GPT-4 scores 67%. Code Llama 34B is not GPT-4 quality but it is definitively better than GPT-3.5 on this benchmark, which was the most widely deployed model in coding tools before GPT-4 Turbo.

Enterprise and local deployment

Code Llama 34B runs in 4-bit quantisation on 24GB of VRAM, available in an A10G instance on most cloud providers. For organisations that cannot send their proprietary code to an external API due to data classification policies, Code Llama provides a coding assistant that runs entirely on-premises. This unblocks a significant category of enterprise use case that was previously inaccessible.

The developer tooling ecosystem response

Ollama, LM Studio, and oobabooga TextGen WebUI all supported Code Llama within days of release. IDE integrations via Continue and Tabby (the open source GitHub Copilot alternative) followed. The open source coding assistant ecosystem is now credible for teams that need local or on-premises deployment.