I've been following Meta's LLaMA releases and the latest one is significant. On July 18th, Meta released LLaMA 2 in partnership with Microsoft, marking a major milestone in open source large language models. The 70B model and its instruction-tuned variants are the first to be genuinely viable for enterprise use.
So what makes LLaMA 2 different from its predecessor? For starters, LLaMA 1 was released without an explicit licence and with restrictions on commercial use. In contrast, LLaMA 2 has a clear licence that permits commercial use for organisations with fewer than 700 million monthly active users.
The instruction-tuned variants, known as LLaMA 2-Chat, are trained on human preference data using reinforcement learning from human feedback. This is the same technique that makes ChatGPT useful for conversation. As a result, LLaMA 2 follows instructions and maintains conversation context, making it a much more useful tool for enterprises.
I was curious about the technical requirements for running LLaMA 2, so I dug into the numbers. The 70B model requires 35GB of VRAM in 8-bit quantisation, which can fit on two 24GB A100 GPUs. If you're using 4-bit quantisation, it can even fit on a single 40GB A100. This makes it much more accessible for organisations that want to run the model in the cloud.
For example, I worked with a team that used LLaMA 2 70B for a chatbot application, and we were able to deploy it on a cloud instance with two 24GB A100 GPUs, resulting in a significant reduction in costs compared to using a cloud AI vendor. We also used the Hugging Face Transformers library to fine-tune the model, which made the process much easier and more efficient.
One of the key trade-offs to consider when deploying LLaMA 2 is the choice between 8-bit and 4-bit quantisation. While 4-bit quantisation reduces the VRAM requirements, it also increases the risk of model degradation. In our experience, the 8-bit quantisation provides a good balance between performance and accuracy, but this may vary depending on the specific use case and requirements.
Another important consideration is the choice of fine-tuning technique. QLoRA, a parameter-efficient fine-tuning technique, allows you to adapt LLaMA 2 70B on a single 48GB GPU with a dataset of just a few thousand examples. This makes it possible for organisations to create domain-specialised LLMs without breaking the bank. However, QLoRA may not always provide the best results, and other techniques such as AdapterHub may be more suitable for certain use cases.
One of the key benefits of LLaMA 2 is its cost competitiveness. For tasks where GPT-3.5 quality is sufficient, LLaMA 2 70B provides a self-hosted option that is significantly cheaper than using the GPT-4 API. This is a big deal for organisations that need to run large language models at scale.
The open source fine-tuning ecosystem around LLaMA 2 is also impressive. QLoRA, a parameter-efficient fine-tuning technique, allows you to adapt LLaMA 2 70B on a single 48GB GPU with a dataset of just a few thousand examples. This makes it possible for organisations to create domain-specialised LLMs without breaking the bank.
In terms of specific numbers, we've seen that LLaMA 2 70B can be fine-tuned on a dataset of 10,000 examples in just a few hours using QLoRA, resulting in a model that is highly accurate and effective for a specific domain. This is a significant improvement over other fine-tuning techniques, which can require much larger datasets and more computational resources.
As I've been talking to enterprises about LLaMA 2, I've noticed a common pattern emerging. Many organisations are adopting a two-tier architecture, using LLaMA 2 7B or 13B for high-volume classification and routing tasks, and GPT-4 or Claude 2 for complex reasoning and generation tasks. This combination reduces API costs significantly while maintaining quality for the tasks that require frontier models.
The implications of this are significant. With LLaMA 2, organisations can now build their own large language models without relying on cloud AI vendors. This opens up new possibilities for customisation and cost savings, and I'm excited to see how enterprises will use this technology to drive innovation and growth.