A new version of ResearchHub is available.Try it now
Paper
Document
Download
Flag content
Preprint
10 Tipped
≈ $0.00
10

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Save
TipTip
Document
Download
Flag content
10
TipTip
Save
Document
Download
Flag content
Physics of Language Models: Part 3.3,
Knowledge Capacity Scaling Laws
Zeyuan Allen-Zhu
zeyuanallenzhu@meta.com
Meta / FAIR Labs
Yuanzhi Li
Yuanzhi.Li@mbzuai.ac.ae
Mohamed bin Zayed University of AI
April 7, 2024
(version 1)
Abstract
Scaling laws describe the relationship between the size of language models and their ca-
pabilities. Unlike prior studies that evaluate a model’s capability via loss or benchmarks, we
estimate the number of knowledge bits a model stores. We focus on factual knowledge repre-
sented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through
multiple controlled datasets, we establish that language models can and only can store 2 bits
of knowledge per parameter, even when quantized to int8, and such knowledge can be
flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of
knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation.
More broadly, we present 12 results on how (1) training duration, (2) model architec-
ture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio
affect a model’s knowledge storage capacity. Notable insights include:
The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral
architectures in knowledge storage, particularly over shorter training durations. This arises
because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train.
Prepending training data with domain names (e.g., wikipedia.org) significantly increases
a model’s knowledge capacity. Language models can autonomously identify and prioritize
domains rich in knowledge, optimizing their storage capacity.
Submitted for Meta internal review on March 14, 2024. We would like to thank Lin Xiao and Yuchen Zhang for
many helpful conversations. We would like to extend special thanks to Ian Clark, Gourab De, Anmol Mann, and Max
Pfeifer from W&B, as well as Lucca Bertoncini, Liao Hu, Caleb Ho, Apostolos Kokolis, and Shubho Sengupta from
Meta FAIR NextSys; Henry Estela, Wil Johnson, Rizwan Hashmi, and Lucas Noah from Meta Cloud Foundation;
without their invaluable support, the extensive experiments in this paper would not have been possible.
arXiv:2404.05405v1 [cs.CL] 8 Apr 2024
1 Introduction
The scaling laws of large language models remain a pivotal area of research, enabling predictions
about the performance of extremely large models through experiments with smaller ones. On the
training time aspect, established scaling laws [1, 13, 14, 16, 21] discuss the optimal training flops
versus model size. However, recent studies [12, 24, 25] challenge these laws, demonstrating that
training smaller models with significantly more flops can yield superior results. While these laws
talk about how much time/data is needed to train a model of a certain size, another fundamental
question is: what is the ultimate performance a model can achieve, assuming sufficient training?
Despite the known emergent behaviors in large models [8, 34], there is a lack of a principled,
quantitative analysis on how model size impacts its capacity when adequately trained.1
Traditional theory on overparameterization suggests that scaling up model size in sufficiently
trained models can enhance memorization of training data [6], improve generalization error [15,
27, 28], and better fit complex target functions [5, 23]. However, these results often overlook large
constant or polynomial factors, leading to a significant discrepancy from practical outcomes.
In this paper, we introduce a principled framework to examine highly accurate scaling laws
concerning model size versus its knowledge storage capacity. It is intuitive that larger language
models can store more knowledge, but does the total knowledge scale linearly with the model’s
size? What is the exact constant of this scaling? Understanding this constant is crucial for
assessing the efficiency of transformer models in knowledge storage and how various factors (e.g.,
architecture, quantization, training duration, etc.) influence this capacity.
Knowledge is a, if not the, pivotal component of human intelligence, accumulated over our
extensive history. Large language models like GPT-4 are celebrated not just for their sophisticated
logic but also for their superior knowledge base. Despite rumors of GPT-4 having over 1T param-
eters, is it necessary to store all human knowledge? Could a 10B model, if trained sufficiently with
high-quality data, match GPT-4’s knowledge capacity? Our paper seeks to address these questions.
Knowledge Pieces. Defining “one piece of human knowledge” precisely is challenging. This paper
aims to make progress by focusing on a restricted, yet sufficiently interesting domain. We define a
piece of knowledge as a (name, attribute, value) tuple, e.g., (Anya Forger, birthday, 10/2/1996);
and many data in world knowledge benchmarks can be broken down into pieces like this.2
We generate synthetic knowledge-only datasets by uniformly at random generating (name, at-
tribute, value) tuples from a knowledge base and converting them into English descriptions. We
pretrain language models (e.g., GPT-2, LLaMA, Mistral) on these texts using a standard auto-
regressive objective from random initialization, and “estimate” the learned knowledge. By varying
the number of knowledge pieces and model sizes, we outline a knowledge capacity scaling law.
Our idealized setting, free from irrelevant data, allows for more accurate scaling law compu-
tations — we also discuss how “junk” data affects capacity later in Section 10. In contrast, it is
difficult to quantify real-life knowledge; for instance, if LLaMA-70B outperforms LLaMA-7B by
30% on a benchmark, it doesn’t necessarily mean a tenfold model scaling only boosts capacity
by 30% (see Footnote 1). The synthetic setting also lets us adjust various hyperparameters, like
1There is a rich literature comparing how pretrained models perform on benchmark tasks. Most comparisons are
for different model families trained over different data: if LLaMA-70B is better than Mistral-7B, does the gain come
from its choice of pretrain data, or the architecture difference, or really the size of the model? Some comparisons are
among the same architecture, such as LLaMA-70B scores 63.6% on the world knowledge benchmark while LLaMA-7B
scores only 48.9% [33]; does this mean increasing model size by 10x increases its capacity only to 130% = 63.6/48.9?
Thus, it is highly important to use a more principled framework to study scaling laws in a controlled setting.
2Examples include (Africa, largest country, Sudan) and (It Happened One Night, director, Frank Capra) in Trivi-
aQA [20], or (Teton Dam, collapse date, 06/05/1976) and (USA, Capital, Washington D.C.) in NaturalQuestions [22].
1
100%