Meta’s Llama 3 Surpasses Previous Iterations with Improved Training Data and Techniques
Meta’s latest release, Llama 3, has made significant strides in the field of large language models (LLMs), outperforming other open LLMs and rivaling closed models from OpenAI or Anthropic. The sparse release blog post, however, left many questions unanswered. In an attempt to extract hidden details, here is a summary of the enhancements and improvements in Llama 3 compared to its predecessor, Llama 2.
Enhanced Pretraining and Data Quality
Llama 3 underwent a 7x increase in pretraining, from 2T tokens to 15T, on sequences of 8,192 tokens. This scaling up was accompanied by improved data quality, achieved through new filtering methods such as heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality. Additionally, Llama 2 was used to generate synthetic training data for text-quality classifiers. Extensive experiments were conducted to find the optimal data mix from various sources.
Model Improvements and Adjustments
Several changes were made to the Llama 3 model itself. Attention-mask was used to ensure self-attention did not cross documents, a feature not present in Llama 2 or OpenAI GPT-3. The input sequence length was increased from 4096 to 8192, and a new tokenizer with a 128k vocabulary was introduced, reducing the number of required tokens by 15% compared to Llama 2. All model sizes now use grouped query attention (GQA).
Fine-Tuning and Reward Model
Llama 3 underwent a combination of supervised fine-tuning (SFT), rejection sampling (RS), proximal policy optimization (PPO), and direct policy optimization (DPO) for fine-tuning. Training on preference rankings improved the model’s ability