Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Artificial Intelligence

Law

Machine Learning

Authors

Zeyuan Allen-Zhu,Yuanzhi Li

Journal

arXiv (Cornell University)

Published

Apr 8, 2024

DOI

10.48550/arxiv.2404.05405

Save

Tip

Document

Download

Flag content

Tip

Save

Document

Download

Flag content

Physics of Language Models: Part 3.3,
Knowledge Capacity Scaling Laws
Zeyuan Allen-Zhu
zeyuanallenzhu@meta.com
Meta / FAIR Labs
Yuanzhi Li
Yuanzhi.Li@mbzuai.ac.ae
Mohamed bin Zayed University of AI
April 7, 2024
(version 1)∗
Abstract
Scaling laws describe the relationship between the size of language models and their ca-
pabilities. Unlike prior studies that evaluate a model’s capability via loss or benchmarks, we
estimate the number of knowledge bits a model stores. We focus on factual knowledge repre-
sented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through
multiple controlled datasets, we establish that language models can and only can store 2 bits
of knowledge per parameter, even when quantized to int8, and such knowledge can be
ﬂexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of
knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation.
More broadly, we present 12 results on how (1) training duration, (2) model architec-
ture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio
aﬀect a model’s knowledge storage capacity. Notable insights include:
• The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral
architectures in knowledge storage, particularly over shorter training durations. This arises
because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train.
• Prepending training data with domain names (e.g., wikipedia.org) signiﬁcantly increases
a model’s knowledge capacity. Language models can autonomously identify and prioritize
domains rich in knowledge, optimizing their storage capacity.
∗Submitted for Meta internal review on March 14, 2024. We would like to thank Lin Xiao and Yuchen Zhang for
many helpful conversations. We would like to extend special thanks to Ian Clark, Gourab De, Anmol Mann, and Max
Pfeifer from W&B, as well as Lucca Bertoncini, Liao Hu, Caleb Ho, Apostolos Kokolis, and Shubho Sengupta from
Meta FAIR NextSys; Henry Estela, Wil Johnson, Rizwan Hashmi, and Lucas Noah from Meta Cloud Foundation;
without their invaluable support, the extensive experiments in this paper would not have been possible.
arXiv:2404.05405v1 [cs.CL] 8 Apr 2024

1 Introduction
The scaling laws of large language models remain a pivotal area of research, enabling predictions
about the performance of extremely large models through experiments with smaller ones. On the
training time aspect, established scaling laws [1, 13, 14, 16, 21] discuss the optimal training ﬂops
versus model size. However, recent studies [12, 24, 25] challenge these laws, demonstrating that
training smaller models with signiﬁcantly more ﬂops can yield superior results. While these laws
talk about how much time/data is needed to train a model of a certain size, another fundamental
question is: what is the ultimate performance a model can achieve, assuming suﬃcient training?
Despite the known emergent behaviors in large models [8, 34], there is a lack of a principled,
quantitative analysis on how model size impacts its capacity when adequately trained.1
Traditional theory on overparameterization suggests that scaling up model size in suﬃciently
trained models can enhance memorization of training data [6], improve generalization error [15,
27, 28], and better ﬁt complex target functions [5, 23]. However, these results often overlook large
constant or polynomial factors, leading to a signiﬁcant discrepancy from practical outcomes.
In this paper, we introduce a principled framework to examine highly accurate scaling laws
concerning model size versus its knowledge storage capacity. It is intuitive that larger language
models can store more knowledge, but does the total knowledge scale linearly with the model’s
size? What is the exact constant of this scaling? Understanding this constant is crucial for
assessing the eﬃciency of transformer models in knowledge storage and how various factors (e.g.,
architecture, quantization, training duration, etc.) inﬂuence this capacity.
Knowledge is a, if not the, pivotal component of human intelligence, accumulated over our
extensive history. Large language models like GPT-4 are celebrated not just for their sophisticated
logic but also for their superior knowledge base. Despite rumors of GPT-4 having over 1T param-
eters, is it necessary to store all human knowledge? Could a 10B model, if trained suﬃciently with
high-quality data, match GPT-4’s knowledge capacity? Our paper seeks to address these questions.
Knowledge Pieces. Deﬁning “one piece of human knowledge” precisely is challenging. This paper
aims to make progress by focusing on a restricted, yet suﬃciently interesting domain. We deﬁne a
piece of knowledge as a (name, attribute, value) tuple, e.g., (Anya Forger, birthday, 10/2/1996);
and many data in world knowledge benchmarks can be broken down into pieces like this.2
We generate synthetic knowledge-only datasets by uniformly at random generating (name, at-
tribute, value) tuples from a knowledge base and converting them into English descriptions. We
pretrain language models (e.g., GPT-2, LLaMA, Mistral) on these texts using a standard auto-
regressive objective from random initialization, and “estimate” the learned knowledge. By varying
the number of knowledge pieces and model sizes, we outline a knowledge capacity scaling law.
Our idealized setting, free from irrelevant data, allows for more accurate scaling law compu-
tations — we also discuss how “junk” data aﬀects capacity later in Section 10. In contrast, it is
diﬃcult to quantify real-life knowledge; for instance, if LLaMA-70B outperforms LLaMA-7B by
30% on a benchmark, it doesn’t necessarily mean a tenfold model scaling only boosts capacity
by 30% (see Footnote 1). The synthetic setting also lets us adjust various hyperparameters, like
1There is a rich literature comparing how pretrained models perform on benchmark tasks. Most comparisons are
for diﬀerent model families trained over diﬀerent data: if LLaMA-70B is better than Mistral-7B, does the gain come
from its choice of pretrain data, or the architecture diﬀerence, or really the size of the model? Some comparisons are
among the same architecture, such as LLaMA-70B scores 63.6% on the world knowledge benchmark while LLaMA-7B
scores only 48.9% [33]; does this mean increasing model size by 10x increases its capacity only to 130% = 63.6/48.9?
Thus, it is highly important to use a more principled framework to study scaling laws in a controlled setting.
2Examples include (Africa, largest country, Sudan) and (It Happened One Night, director, Frank Capra) in Trivi-
aQA [20], or (Teton Dam, collapse date, 06/05/1976) and (USA, Capital, Washington D.C.) in NaturalQuestions [22].
1

100%

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Scan to connect with one of our mobile apps

Coinbase Wallet app

Coinbase app

Or try the Coinbase Wallet browser extension