Skip to main content

How to Run LLMs Larger than RAM

Machine-Learning Large-Language-Models Linux
Ryan Gibson
Author
Ryan Gibson
Quantitative Analyst | Computer Scientist
Table of Contents

It’s easier than ever to run large language models (LLMs) like those behind ChatGPT on your local machine, without relying on third-party services. It’s free and keeps everything private and confidential!

Thanks to regular open-source releases from tech giants like Meta, Google, Microsoft, and Alibaba, powerful models are now widely available for public deployment and use.

In fact, there are impressively competent models that are lightweight enough to run on practically any smartphone. Notably, the recent “on-device” Llama 3.2 releases from Meta are only 1-2 GB in size. However, the larger models require increasingly expensive hardware to run properly.

For example, if I try to run a moderately-sized1 LLM on my budget $300 dev laptop with 8 GB of RAM2, I might get this.

$ ollama run qwen2.5:14b
Error: model requires more system memory (10.9 GiB) than is available (5.9 GiB)

As the message says, all we need to do is add more system memory. But this doesn’t necessarily mean that we have to go out and upgrade our hardware.

“Adding” RAM from your disk: swap space
#

If your disk is reasonably fast, you can generally offload some memory onto it, a process known as “swap” on Linux (the Windows keywords would be “increasing the size of the page file”). Swap space lets the kernel temporarily move inactive pages of memory to disk, freeing up RAM for other uses.

For example, these commands will temporarily add 8 GB of usable memory from the disk.3

sudo fallocate -l 8G /extra_swapfile  # create an 8 GB file
sudo chmod 600 /extra_swapfile  # make sure the file is only accessible by the root user
sudo mkswap /extra_swapfile  # initialize the swap file
sudo swapon /extra_swapfile  # activate and start using the swap file

Warning: This is honestly an abuse of swap and should not be regularly relied upon unless you are willing to drastically shorten your disk’s lifespan.

Afterward, you should be able to load the larger models without much trouble. Even on my incredibly cheap laptop, I am able to get 2-3 tokens/s (~100-150 words per minute) on 9B and 14B parameter models.4

This is extremely impressive for a machine that typically only has 5-6 GB of spare RAM!

Two lengthy text responses from LLMs answering the prompt "Hi! Please give a brief overview of what LLMs are and
how they work." They run at ~0.5-0.7 tokens/s during prompt evaluation and ~2.5-3.2 tokens/s during inference.
Sample LLM responses from Google’s Gemma2 (left) and Alibaba’s Qwen2.5 (right), demonstrating performance at ~0.5-0.7 tokens/s during prompt evaluation and ~2.5-3.2 tokens/s during inference.

Obviously, this comes with the downside of worse performance since the system is thrashing. Each forward pass of the neural network involves shuffling data from disk to RAM and back.5

Simply put, if you need to effectively run an extra 1 GB of file transfers for every word the model generates, that takes a long time compared to having the model fit entirely in RAM.

A readout of CPU usage showing ~25% normal usage (green) and ~25% kernel usage (red) across 8 cores.
CPU usage during LLM inference with memory thrashing, showing relatively low overall usage but high kernel activity (in red) due to frequent paging between RAM and disk.

As such, this can be useful in a pinch for experimentation, but you can only really scale this up as much as your patience would allow. Running significant portions of the model off disk necessarily means running the whole process several orders of magnitude slower than usual.

See also and references
#

  • Mozilla’s llamafile is probably the easiest way to get started with local LLMs. It packages all dependencies and model weights into a single executable, so it just takes a single click to run everything.
  • Ollama is a very popular tool that provides more flexibility and customization for users comfortable with a small amount of extra setup.
  • /r/LocalLLaMA, a Reddit community focused on local LLM usage, news, and experiences related to running LLMs on personal hardware.
  • The Wikipedia article on thrashing, which is the term for this situation where we are constantly paging memory back and forth to disk. This is part of a broader memory management technique called virtual memory.
  • A more in-depth experiment of mine from university on analyzing page faults through a custom Linux kernel module.

  1. While 14 billion parameters might seem absurdly massive for most applications, this is indeed “moderately-sized” in the world of consumer-grade LLMs. Many users work with models in the 70B parameter range, which require at least 40 GB of spare memory across system RAM and GPU VRAM. ↩︎

  2. Technically, this machine has 7 GB of usable system RAM since 1 GB is reserved for the APU, but it doesn’t matter much here. The key part is that this is a very low-end, secondary computer. ↩︎

  3. To remove the swap file, you can simply run sudo swapoff /extra_swapfile and delete it. To make the swap file persistent across boots, you’d add an entry into /etc/fstab↩︎

  4. Here, I’m using a slightly quantized version of Qwen2.5:14b, meaning the model size has been trimmed and reduced through various lossy optimizations. However, I was still able to run the default version at a much slower rate. ↩︎

  5. CPUs generally cannot operate directly on data stored in swap directly. So, when the program needs data stored on disk, it has to first move some data from memory to disk, and then load the required data from disk back into memory. ↩︎

Related

Leaderboard of Open LLMs Ranked by LLM Judges
Machine-Learning Large-Language-Models Leaderboard Evaluation
An evaluation of recent consumer-grade open LLMs based on ratings generated through an LLM-as-a-judge framework.
Evaluating LLM Performance via LLM Judges
Machine-Learning Large-Language-Models Evaluation Methodology Extra
Methodology details for how LLMs can rate the performance of other LLMs.
About me
About Programming Cybersecurity Machine-Learning Github
A brief introduction to my work as a quantitative analyst, my academic background, and my interests in programming, mathematics, and personal projects.