LLMin a flash: Efficient Large Language Model Inference with Limited Memory Keivan Alizadeh∗, Iman Mirzadeh†, Dmitry Belenko‡, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar§ Apple Abstract Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. How- ever, their intensive computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently run- ning LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.Within this flash memory-informed framework, we introduce two principal techniques.First, “windowing” strategically reduces data transfer by reusing previously activated neurons, and second, “row-column bundling”, tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory.These methods collectivelyenablerunningmodelsupto twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively.Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory. 1Introduction In recent years, large language models (LLMs), such as GPT-3 (Brown et al.,2020), OPT (Zhang et al.,2022b), and PaLM (Chowdhery et al.,2022), have demonstrated strong performance across a ∗Primary Author:kalizadehvahid@apple.com †Major Contribution:imirzadeh@apple.com ‡Major Contribution:d_belenko@apple.com §Senior Author:farajtabar@apple.com Naive Falcon 7B (CPU) OursNaive OPT 6.7B (CPU) OursNaive OPT6.7B (GPU) Ours 100 450 700 2250 3100 Inference Latency (ms) ComputeLoad From FlashMemory Management Figure 1: Inference latency of 1 token when half the memory of the model is available. wide range of natural language tasks. However, the unprecedented capabilities of these models come with substantial computational and memory re- quirements for inference. LLMs can contain hun- dreds of billions or even trillions of parameters, making it challenging to load and run them effi- ciently, especially on resource-constrained devices. Currently, the standard approach is to load the en- tire model into DRAM for inference (Rajbhandari et al.,2021;Aminabadi et al.,2022). However, this severely limits the maximum model size that can be run. For example, a 7 billion parameter model requires over 14GB of memory just to load the parameters in half-precision floating point format, exceeding the capabilities of most edge devices. To address this limitation, we propose to store the model parameters on flash memory, which is at least an order of magnitude larger than DRAM. Then, during inference, we directly and cleverly load the required parameters from the flash mem- ory, avoiding the need to fit the entire model in DRAM. Our methodology is built on the top of recent works that have shown LLMs exhibit a high degree of sparsity in the FeedForward Network (FFN) layers, with models like OPT (Zhang et al., 1 arXiv:2312.11514v1 [cs.CL] 12 Dec 2023
100%
Scan to connect with one of our mobile apps
Coinbase Wallet app
Connect with your self-custody wallet
Coinbase app
Connect with your Coinbase account
Open Coinbase Wallet app
Tap Scan
Or try the Coinbase Wallet browser extension
Connect with dapps with just one click on your desktop browser
Add an additional layer of security by using a supported Ledger hardware wallet