How to Run LLMs Larger than RAM

Table of Contents

It’s easier than ever to run large language models (LLMs) like those behind ChatGPT on your local machine, without relying on third-party services. It’s free and keeps everything private and confidential!

Thanks to regular open-source releases from tech giants like Meta, Google, Microsoft, and Alibaba, powerful models are now widely available for public deployment and use.

In fact, there are impressively competent models that are lightweight enough to run on practically any smartphone. Notably, the recent “on-device” Llama 3.2 releases from Meta are only 1-2 GB in size. However, the larger models require increasingly expensive hardware to run properly.

For example, if I try to run a moderately-sized¹ LLM on my budget $300 dev laptop with 8 GB of RAM², I might get this.

$ ollama run qwen2.5:14b
Error: model requires more system memory (10.9 GiB) than is available (5.9 GiB)

As the message says, all we need to do is add more system memory. But this doesn’t necessarily mean that we have to go out and upgrade our hardware.

“Adding” RAM from your disk: swap space
#

If your disk is reasonably fast, you can generally offload some memory onto it, a process known as “swap” on Linux (the Windows keywords would be “increasing the size of the page file”). Swap space lets the kernel temporarily move inactive pages of memory to disk, freeing up RAM for other uses.

For example, these commands will temporarily add 8 GB of usable memory from the disk.³

sudo fallocate -l 8G /extra_swapfile  # create an 8 GB file
sudo chmod 600 /extra_swapfile  # make sure the file is only accessible by the root user
sudo mkswap /extra_swapfile  # initialize the swap file
sudo swapon /extra_swapfile  # activate and start using the swap file

Warning: This is honestly an abuse of swap and should not be regularly relied upon unless you are willing to drastically shorten your disk’s lifespan.

Afterward, you should be able to load the larger models without much trouble. Even on my incredibly cheap laptop, I am able to get 2-3 tokens/s (~100-150 words per minute) on 9B and 14B parameter models.⁴

This is extremely impressive for a machine that typically only has 5-6 GB of spare RAM!

Two lengthy text responses from LLMs answering the prompt "Hi! Please give a brief overview of what LLMs are and
how they work." They run at ~0.5-0.7 tokens/s during prompt evaluation and ~2.5-3.2 tokens/s during inference. — Sample LLM responses from Google’s Gemma2 (left) and Alibaba’s Qwen2.5 (right), demonstrating performance at ~0.5-0.7 tokens/s during prompt evaluation and ~2.5-3.2 tokens/s during inference.

EDIT (February 2025): Indeed, some users have successfully run the full 671B parameter Deepseek-R1 model off their SSDs, running at ~40-70 words per minute with under 64 GB of RAM! This is partially thanks to its mixture of experts (MoE) architecture where only ~37B parameters are activated during each forward pass, making it surprisingly well suited to this use case.

Obviously, this comes with the downside of worse performance since the system is thrashing. Each forward pass of the neural network involves shuffling data from disk to RAM and back.⁵

Simply put, if you need to effectively run an extra 1 GB of file transfers for every word the model generates, that takes a long time compared to having the model fit entirely in RAM.

A readout of CPU usage showing ~25% normal usage (green) and ~25% kernel usage (red) across 8 cores. — CPU usage during LLM inference with memory thrashing, showing relatively low overall usage but high kernel activity (in red) due to frequent paging between RAM and disk.

As such, this can be useful in a pinch for experimentation, but you can only really scale this up as much as your patience would allow. Running significant portions of the model off disk necessarily means running the whole process several orders of magnitude slower than usual.

How to Run LLMs Larger than RAM

“Adding” RAM from your disk: swap space
#

See also and references
#

Related

“Adding” RAM from your disk: swap space#

See also and references#

Related

“Adding” RAM from your disk: swap space
#

See also and references
#