Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws Zeyuan Allen-Zhu zeyuanallenzhu@meta.com Meta / FAIR Labs Yuanzhi Li Yuanzhi.Li@mbzuai.ac.ae Mohamed bin Zayed University of AI April 7, 2024 (version 1)∗ Abstract Scaling laws describe the relationship between the size of language models and their ca- pabilities.Unlike prior studies that evaluate a model’s capability via loss or benchmarks, we estimate the number of knowledgebitsa model stores. We focus on factual knowledge repre- sented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page.Through multiple controlled datasets, we establish that language models can and only can store2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 resultson how (1) training duration, (2) model architec- ture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model’s knowledge storage capacity. Notable insights include: •The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architecturesin knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train. •Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model’s knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity. ∗Submitted for Meta internal review on March 14, 2024. We would like to thank Lin Xiao and Yuchen Zhang for many helpful conversations. We would like to extend special thanks to Ian Clark, Gourab De, Anmol Mann, and Max Pfeifer from W&B, as well as Lucca Bertoncini, Liao Hu, Caleb Ho, Apostolos Kokolis, and Shubho Sengupta from Meta FAIR NextSys; Henry Estela, Wil Johnson, Rizwan Hashmi, and Lucas Noah from Meta Cloud Foundation; without their invaluable support, the extensive experiments in this paper would not have been possible. arXiv:2404.05405v1 [cs.CL] 8 Apr 2024
1Introduction The scaling laws of large language models remain a pivotal area of research, enabling predictions about the performance of extremely large models through experiments with smaller ones. On the training time aspect, established scaling laws [1, 13, 14, 16, 21] discuss the optimal training flops versus model size.However, recent studies [12, 24, 25] challenge these laws, demonstrating that training smaller models with significantly more flops can yield superior results. While these laws talk about how much time/data is needed to train a model of a certain size, another fundamental question is:what is the ultimate performance a model can achieve, assuming sufficient training? Despite the known emergent behaviors in large models [8, 34], there is alack of a principled, quantitative analysison how model size impacts its capacity when adequately trained.1 Traditional theory on overparameterization suggests that scaling up model size in sufficiently trained models can enhance memorization of training data [6], improve generalization error [15, 27, 28], and better fit complex target functions [5, 23]. However, these results often overlook large constant or polynomial factors, leading to a significant discrepancy from practical outcomes. In this paper, we introduce a principled framework to examinehighly accuratescaling laws concerning model size versus itsknowledge storage capacity.It is intuitive that larger language models can store more knowledge, but does the total knowledge scale linearly with the model’s size?What is theexact constantof this scaling?Understanding this constant is crucial for assessing the efficiency of transformer models in knowledge storage and how various factors (e.g., architecture, quantization, training duration, etc.) influence this capacity. Knowledge is a, if not the, pivotal component of human intelligence, accumulated over our extensive history. Large language models like GPT-4 are celebrated not just for their sophisticated logic but also for their superior knowledge base. Despite rumors of GPT-4 having over 1T param- eters,is it necessary to store all human knowledge?Could a 10B model, if trained sufficiently with high-quality data, match GPT-4’s knowledge capacity? Our paper seeks to address these questions. Knowledge Pieces.Defining “one piece of human knowledge” precisely is challenging. This paper aims to make progress by focusing on a restricted, yet sufficiently interesting domain. We define a pieceof knowledge as a (name, attribute, value) tuple, e.g., (Anya Forger, birthday, 10/2/1996); and many data in world knowledge benchmarks can be broken down into pieces like this.2 We generatesyntheticknowledge-only datasets by uniformly at random generating (name, at- tribute, value) tuples from a knowledge base and converting them into English descriptions.We pretrain language models (e.g., GPT-2, LLaMA, Mistral) on these texts using a standard auto- regressive objective from random initialization, and “estimate” the learned knowledge. By varying the number of knowledge pieces and model sizes, we outline a knowledge capacity scaling law. Our idealized setting, free from irrelevant data, allows for more accurate scaling law compu- tations — we also discuss how “junk” data affects capacity later in Section 10. In contrast, it is difficult to quantify real-life knowledge; for instance, if LLaMA-70B outperforms LLaMA-7B by 30% on a benchmark, it doesn’t necessarily mean a tenfold model scaling only boosts capacity by 30% (see Footnote 1).The synthetic setting also lets us adjust various hyperparameters, like 1There is a rich literature comparing how pretrained models perform on benchmark tasks. Most comparisons are for different model families trained over different data: if LLaMA-70B is better than Mistral-7B, does the gain come from its choice of pretrain data, or the architecture difference, or really the size of the model? Some comparisons are among the same architecture, such as LLaMA-70B scores 63.6% on the world knowledge benchmark while LLaMA-7B scores only 48.9% [33]; does this mean increasing model size by 10x increases its capacity only to 130% = 63.6/48.9? Thus, it is highly important to use a more principled framework to study scaling laws in a controlled setting. 2Examples include (Africa, largest country, Sudan) and (It Happened One Night, director, Frank Capra) in Trivi- aQA [20], or (Teton Dam, collapse date, 06/05/1976) and (USA, Capital, Washington D.C.) in NaturalQuestions [22]. 1
100%
Scan to connect with one of our mobile apps
Coinbase Wallet app
Connect with your self-custody wallet
Coinbase app
Connect with your Coinbase account
Open Coinbase Wallet app
Tap Scan
Or try the Coinbase Wallet browser extension
Connect with dapps with just one click on your desktop browser
Add an additional layer of security by using a supported Ledger hardware wallet