Meta released LLaMA 2 on July 18th in partnership with Microsoft. The 70B model and its instruction-tuned variants represent the first genuinely enterprise-viable open source large language model.
What makes LLaMA 2 different
LLaMA 1 was released without an explicit licence and with restrictions on commercial use. LLaMA 2 has an explicit licence that permits commercial use for organisations with fewer than 700 million monthly active users. The instruction-tuned variants (LLaMA 2-Chat) are trained on human preference data using reinforcement learning from human feedback, the same technique that makes ChatGPT useful for conversation. The result is a model that follows instructions and maintains conversation context.
The 70B model in context
LLaMA 2 70B requires 35GB of VRAM in 8-bit quantisation, fitting on two 24GB A100 GPUs. In 4-bit quantisation, it fits on a single 40GB A100. Cloud instances for running this model are widely available and the cost for a sustained inference workload is significantly below GPT-4 API pricing. For tasks where GPT-3.5 quality suffices, LLaMA 2 70B provides a self-hosted option that is cost-competitive for sustained use.
The fine-tuning ecosystem
The open source fine-tuning ecosystem around LLaMA 2 is substantial. QLoRA, a parameter-efficient fine-tuning technique, lets you adapt LLaMA 2 70B on a single 48GB GPU with a dataset of a few thousand examples. The cost of creating a domain-specialised LLM from LLaMA 2 is now within reach for organisations that previously needed a cloud AI vendor contract.
Enterprise adoption pattern
The emerging enterprise pattern for LLaMA 2 is a two-tier architecture: LLaMA 2 7B or 13B for high-volume classification and routing tasks where cost per request matters, and GPT-4 or Claude 2 for complex reasoning and generation tasks. The combination reduces API costs significantly while maintaining quality for the tasks that require frontier models.