Paper
Document
Submit new version
Download
Flag content
15

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Authors
Keivan Alizadeh,Iman Mirzadeh
Dmitry Belenko,Karen Khatamifard,Minsik Cho,Carlo Mundo,Mohammad Rastegari
+5 authors
,Mehrdad Farajtabar
Published
Dec 12, 2023
Posted by
Save
TipTip
Document
Submit new version
Download
Flag content
15
TipTip
Save
Document
Submit new version
Download
Flag content
LLM in a flash:
Efficient Large Language Model Inference with Limited Memory
Keivan Alizadeh, Iman Mirzadeh , Dmitry Belenko , Karen Khatamifard,
Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar§
Apple
Abstract
Large language models (LLMs) are central to
modern natural language processing, delivering
exceptional performance in various tasks. How-
ever, their intensive computational and memory
requirements present challenges, especially
for devices with limited DRAM capacity. This
paper tackles the challenge of efficiently run-
ning LLMs that exceed the available DRAM
capacity by storing the model parameters on
flash memory but bringing them on demand to
DRAM. Our method involves constructing an
inference cost model that harmonizes with the
flash memory behavior, guiding us to optimize
in two critical areas: reducing the volume of
data transferred from flash and reading data
in larger, more contiguous chunks. Within
this flash memory-informed framework, we
introduce two principal techniques. First,
“windowing” strategically reduces data transfer
by reusing previously activated neurons, and
second, “row-column bundling”, tailored to
the sequential data access strengths of flash
memory, increases the size of data chunks
read from flash memory. These methods
collectively enable running models up to
twice the size of the available DRAM, with a
4-5x and 20-25x increase in inference speed
compared to naive loading approaches in CPU
and GPU, respectively. Our integration of
sparsity awareness, context-adaptive loading,
and a hardware-oriented design paves the way
for effective inference of LLMs on devices
with limited memory.
1 Introduction
In recent years, large language models (LLMs),
such as GPT-3 (Brown et al., 2020), OPT (Zhang
et al., 2022b), and PaLM (Chowdhery et al., 2022),
have demonstrated strong performance across a
Primary Author: kalizadehvahid@apple.com
Major Contribution: imirzadeh@apple.com
Major Contribution: d_belenko@apple.com
§Senior Author: farajtabar@apple.com
Naive
Falcon 7B
(CPU)
Ours Naive
OPT 6.7B
(CPU)
Ours Naive
OPT6.7B
(GPU)
Ours
100
450
700
2250
3100
Inference Latency (ms)
Compute Load From Flash Memory Management
Figure 1: Inference latency of 1 token when half the
memory of the model is available.
wide range of natural language tasks. However, the
unprecedented capabilities of these models come
with substantial computational and memory re-
quirements for inference. LLMs can contain hun-
dreds of billions or even trillions of parameters,
making it challenging to load and run them effi-
ciently, especially on resource-constrained devices.
Currently, the standard approach is to load the en-
tire model into DRAM for inference (Rajbhandari
et al., 2021; Aminabadi et al., 2022). However, this
severely limits the maximum model size that can
be run. For example, a 7 billion parameter model
requires over 14GB of memory just to load the
parameters in half-precision floating point format,
exceeding the capabilities of most edge devices.
To address this limitation, we propose to store
the model parameters on flash memory, which is
at least an order of magnitude larger than DRAM.
Then, during inference, we directly and cleverly
load the required parameters from the flash mem-
ory, avoiding the need to fit the entire model in
DRAM. Our methodology is built on the top of
recent works that have shown LLMs exhibit a high
degree of sparsity in the FeedForward Network
(FFN) layers, with models like OPT (Zhang et al.,
1
arXiv:2312.11514v1 [cs.CL] 12 Dec 2023
100%