Table of Contents
- 1. The bottom line: it's almost all about VRAM
- 2. Understand quantization first—it changes everything
- 3. VRAM needed by model size (quick table)
- 4. The context-length / KV-cache trap
- 5. GPUs and Macs in practice (speed guide)
- 6. What you need besides VRAM
- 7. Recommended builds by budget (3 tiers)
- 8. How to tell which model you can run
- Summary
- FAQ
When you want to start with a local LLM, the first worry is usually: "Will it even run on my PC?" The short answer: 90% of the required spec comes down to VRAM (your GPU's memory). Nail that, and you can instantly tell what will run and what won't.
This article lays out a quick VRAM table by model size, a simple formula, the memory trap that grows with context length, realistic speeds per GPU/Mac, and finally recommended builds by budget. Jargon is kept to a minimum so that even a first-timer can figure out "which one should I buy."
It's almost all about VRAM
— It comes down to whether the model fits in memory
VRAM 8–12 GB
7B–14B class. Everyday chat, summarizing, light code. The easiest starting point.
VRAM 24 GB
Up to 32B class. The practical line with a great balance of quality and speed.
40–64 GB+
70B class. Quality approaching mid-tier cloud. Costs rise too.
1. The bottom line: it's almost all about VRAM
PC shopping involves many parts—CPU, GPU, memory—but for local LLMs the single most important thing is VRAM (video memory, the memory on the GPU). The reason is simple: if the whole model fits in VRAM it runs fast and smoothly; if it doesn't, it becomes painfully slow or won't run at all.
💡 In a nutshell: choosing specs for a local LLM goes in this order: "the size of the model you want to run" → "the VRAM it needs" → "a GPU/Mac that meets it." CPU and RAM capacity are secondary.
Apple's M-series chips (Mac) are a special case: thanks to "unified memory," the installed RAM can be used directly as VRAM. So a Mac with lots of memory can run large models even without a dedicated GPU—more on that later.
2. Understand quantization first—it changes everything
Before talking about required VRAM, there's no avoiding quantization. It's a technique that compresses the model to make it lighter, and how much you compress changes the memory need several-fold.
FP16 (uncompressed)
~2 bytes per parameter. Top quality, but eats the most memory. Individuals rarely use it.
Q8 (8-bit)
~1 byte per parameter. About half of FP16. Quality loss is tiny—the "quality-leaning" choice.
Q4 (4-bit)
~0.5–0.7 bytes per parameter. About 1/4 of FP16. A great balance of quality and lightness—the go-to for personal use.
🔑 Rough formula: required VRAM ≈ number of parameters (B) × bytes per parameter. Example: to run a 7B model at Q4, 7 × ~0.6 ≈ ~4–5 GB. Add +10–20% for the KV cache (context, covered next) to be safe.
3. VRAM needed by model size (quick table)
Assuming the most practical Q4 quantization, here are rough VRAM targets by size (including headroom for context). Compare against "your GPU's VRAM" and you'll instantly see your upper limit.
7B–8B class
VRAM ~6–8 GB
Ideal for entry. Chat, summarizing, translation, light code. Achievable on many laptops.
13B–14B class
VRAM ~8–12 GB
A bit smarter answers. The "sweet spot" for mid-range GPUs like the RTX 3060 (12 GB).
32B class
VRAM ~20–24 GB
The upper practical line. The classic single-card target for an RTX 4090 (24 GB).
70B class
VRAM ~40–48 GB+
Serious tier. A high-memory Mac or multiple GPUs is realistic.
Going higher to 100B+ (very large models) needs 128 GB or more—beyond the personal range. Conversely, a tiny 1–3B model runs in around 4 GB, so even a modest PC can get started.
4. The context-length / KV-cache trap
Easy to overlook: memory grows with context length. An LLM keeps the history of the conversation and input in VRAM as a KV cache. The longer the text you handle, the more memory it uses on top of the model itself.
4k
~+0.3 GB on a 7B. Negligible for short questions.
32k
~+2.5 GB on a 7B. Starts to matter for long summaries and chats.
128k
~+10 GB on a 7B. Can exceed the model itself. A caution zone.
📌 Practical tip: "it ran right at the VRAM limit, then crashed when I fed it a long document"—this is why. Estimate your need at the context length you actually use. If you don't handle long documents, just setting a smaller context length frees up memory.
5. GPUs and Macs in practice (speed guide)
Even for the same model, hardware greatly changes speed (tokens generated per second = tok/s). Here are the main options with a rough feel (numbers are guideposts that vary by setup and model).
RTX 3060 (12 GB)
Easy to find used—the entry classic. 7B–14B run comfortably. If cost is the priority, start here.
RTX 4090 (24 GB)
Up to 32B class on a single card. A 7B can exceed 100 tokens/sec. The go-to personal high-end. 70B needs to offload part to the CPU and slows down a lot.
RTX 5090 (32 GB)
More VRAM lets you run 32B at Q8, or a 70B at aggressive quantization on one card. Speed is top-class too.
Apple Mac (M4/M5 Max)
With 64 GB unified memory, even 70B class is possible (speed is modest—around 20–30 tokens/sec on a 70B). Quiet and power-efficient.
CPU only (no GPU)
Small models do run, but slowly. Fine for "just trying it." Daily use really wants a GPU/Mac.
6. What you need besides VRAM
VRAM is the lead, but the supporting cast matters too. Three things to cover at minimum.
System RAM
The catch-all for what doesn't fit in VRAM. 16 GB or more, ideally 32 GB. On a Mac, unified memory counts directly.
Storage (SSD)
A single model is several to tens of GB. If you'll try multiple, keep plenty of free SSD space. NVMe recommended.
Power & cooling
High-end GPUs draw a lot of power and run hot. Leave headroom in power supply and cooling.
7. Recommended builds by budget (3 tiers)
Three patterns that answer "so what should I actually buy?" Pick by use case and budget.
Just trying it: VRAM 8–12 GB
An RTX 3060 (12 GB) class card, or a Mac with 16–24 GB unified memory. 7B–14B class runs, plenty for everyday use. A used GPU is the cheapest way to start.
Using it seriously: VRAM 24 GB
An RTX 4090 (24 GB), or a Mac with 32–48 GB unified memory. 32B class is comfortable, with the best balance of quality and speed. The "just right" choice.
Going for the biggest: 40–64 GB+
An RTX 5090 or multiple GPUs, or a high-end Mac with 64 GB+ unified memory. 70B class approaches mid-tier cloud. Be ready for the cost and power draw.
8. How to tell which model you can run
Not sure which model to pick? See the best local LLM models compared for choices by use, size, and origin.
Check in three steps before you buy or download, and you won't go wrong.
- Check your VRAM (or your Mac's unified memory). This is your ceiling.
- Estimate the rough need with model size (B) × ~0.6 (Q4). Add +10–20% for context.
- Confirm the total fits within your VRAM. If not, pick "one size smaller" or "stronger quantization (Q4 → even lower-bit)."
💡 When unsure, start small: with Ollama or LM Studio, you just pick a model and download. Try a 7B class first, and step up if it feels lacking—that order is safe and reliable.
Summary
The spec you need for a local LLM comes down to three points.
- VRAM is the lead: whether the model fits in memory is everything. A Mac can target large memory via unified memory.
- Quantization and context move the number: at Q4, "size (B) × ~0.6" plus context (+10–20%) is the guide. 7B ≈ 6–8 GB, 32B ≈ 24 GB, 70B ≈ 40 GB+.
- Three tiers by budget: entry (8–12 GB) / standard (24 GB) / serious (40–64 GB+). When unsure, start small and step up gradually.
Once you know the specs, a local LLM gets a lot more approachable. Next, weighing the differences from the cloud, run one on your own machine. The setup steps are covered in how to run a local LLM.
FAQ
Q. Can a regular laptop (no GPU) run a local LLM?
A. Small models (1–3B, or a lightweight 7B) will run, but slowly. It's fine for "trying it out," but for comfortable daily use, a GPU with 8 GB+ VRAM or a Mac with ample unified memory is realistic.
Q. My VRAM is a little short. How can I still run it?
A. Three options: ① choose stronger quantization (a lower-bit build), ② drop to one size smaller model, ③ set a shorter context length. Usually that's enough to fit. You can also offload part to the CPU, but speed drops.
Q. GeForce or Mac—which is better?
A. For speed and expandability, GeForce (NVIDIA GPU). For quiet, power-efficient operation that leverages large memory to run big models, a Mac (unified memory). If you want to handle a 70B class on one machine, a 64 GB+ Mac is a strong option.
Q. How much system RAM do I need?
A. 16 GB or more for system RAM, ideally 32 GB. Note that on a Mac, unified memory doubles as VRAM, so memory capacity directly determines the model size you can run.
Q. So what's a good first machine?
A. For value, a used RTX 3060 (12 GB) for 7B–14B. If budget allows, an RTX 4090 (24 GB) handles up to 32B class on one card and lasts a long time. For Apple fans, a Mac with ample unified memory is the easy route. Start small and step up as needed—that's how you avoid mistakes.