Contents
"A huge 70B (70-billion-parameter) model runs on a single gaming PC at home, not a rack of data-center GPUs." What makes this possible is quantization — a technique that lowers the numerical precision of a model's weights to dramatically shrink its size and memory needs.
Whereas last time's model distillation "moved knowledge into a separate, smaller model," quantization "makes the same model lighter." This article explains it with a photo-compression analogy, covers how much lighter it gets (the memory numbers), the accuracy trade-off, the main methods (GPTQ / AWQ / GGUF / QLoRA), and how to run it locally — all for beginners.
Lower the bit-width, and VRAM drops sharply
— Example: memory needed for a 70B model
* Memory estimates and figures in this article are quoted from public materials (as of June 2026). Actual needs vary by model, format, and context length — read them as directional.
1. What is quantization? Like compressing a photo
Quantization means lowering the numerical precision of a model's weights (parameters). AI weights are usually stored as FP16/FP32 (16/32-bit decimals), and quantization replaces them with integers like INT8 (8-bit) or INT4 (4-bit). Each weight then takes less space, and the whole model gets much lighter.
Think of it as "compressing a high-resolution photo": the original RAW photo (FP16) is beautiful but huge. Compress it to JPEG (INT8/INT4) and the file shrinks to a fraction of the size while looking almost identical. Quantization is the same — sacrifice a little precision for a big reduction in weight. The surprise isn't that it works, but how little you give up.
The number and role of the weights don't change — the vessel (model) stays the same; only the fineness of the representation is made coarser. So knowing the model's structure helps (see how LLM weights work).
2. How much lighter? (the memory numbers)
The effect is obvious in numbers. Per weight: FP32 = 4 bytes, INT8 = 1 byte, INT4 = 0.5 bytes. So going 4-bit uses about one-quarter the memory of FP16.
| Precision | Per weight | 70B model (approx.) | 8B model (approx.) |
|---|---|---|---|
| FP16 (no quantization) | 2 bytes | ~140GB | ~16GB |
| INT8 | 1 byte | ~70GB | ~8GB |
| INT4 | 0.5 bytes | ~35GB | ~4.5-5GB |
* Estimates. Actual values vary with format, overhead, and context length.
The impact is huge. If a 70B model goes from 140GB to 35GB, it runs on a realistic setup instead of several A100s. Quantize an 8B model to 4-bit and it's about 5GB — fitting comfortably in a midrange GPU (8GB VRAM), so you can run it locally on your own PC. This is why quantization is called the "democratization of LLMs."
3. How much accuracy is lost?
The worry is: "won't it get dumber once it's lighter?" The answer is "less than you'd think — but it depends on the bit-width and the task."
🟢 INT8: nearly lossless
For most LLMs, the performance drop is minimal. A safe choice when you want to halve memory while keeping quality.
🟡 INT4: practical with smart methods
For general Q&A and commonsense tasks, degradation is reportedly under 4%. But for math, code generation, and hard reasoning, the loss is more noticeable, so take care.
The accuracy loss shows up technically as "a small rise in perplexity." The key is to "pick the bit-width that fits the task" — INT4 is often plenty for chat or summarization, but for code generation or exact math, consider INT8 or no quantization. Ultimately, evaluate on your own task to confirm it's within tolerance.
4. Main methods: GPTQ / AWQ / GGUF / QLoRA
There are several representative quantization methods and formats. Knowing the names helps you choose models and tools without confusion.
| Method / format | Traits | Best for |
|---|---|---|
| GPTQ | The pioneer that achieved 4-bit compression while keeping accuracy. | GPU inference |
| AWQ | Identifies and protects the ~1% most important weights. Often 1-2% more accurate and faster than GPTQ. | Fast, efficient production inference |
| GGUF | The llama.cpp / Ollama format. Choose levels Q2_K-Q8_0; supports CPU+GPU hybrid. | Running locally on your PC |
| QLoRA | Combines a 4-bit base model with LoRA, enabling fine-tuning on a consumer GPU. | Low-cost fine-tuning |
For a beginner trying it locally, using a GGUF model with Ollama is the easiest path. To optimize production GPU inference, AWQ is a strong choice. To fine-tune a big model cheaply, QLoRA — remembering just that is enough.
5. vs distillation and fine-tuning
Quantization is a "model efficiency/optimization" technique alongside distillation and fine-tuning. They're easy to confuse, so note the difference in goals.
⚖️ Quantization
Make the same model's weights lighter. Same model inside, just a coarser representation.
🧑🏫 Distillation
Move knowledge into a separate, smaller model. Rebuild the vessel smaller.
🎯 Fine-tuning
Further-train for a specific use. Roughly the same size; adds domain knowledge.
The three aren't exclusive — they're usually combined. For example, "quantize a student model that was distilled smaller, to fit it on a phone," or, as with QLoRA, "fine-tune on a quantized base." They stack.
6. How to start and pick the bit-width
No tricky implementation needed. Many already-quantized models are distributed, so you can just download and use them. When unsure, pick by this guide.
To try locally first, use GGUF (Ollama)
Run a quantized model with Ollama in one command. Just touching it is the fastest way to learn.
Pick the bit-width by your VRAM
Tight on VRAM? INT4 (Q4). Have room and want quality? INT8 (Q8). General use is often fine on Q4.
Judge precision by the use case
For code generation or exact math, avoid INT4 and use INT8+. For chat and summarization, INT4 is comfortable.
Summary
Quantization is the key enabler that turns a giant AI into something light enough to run on your own machine. Let's recap.
Key takeaways
- ⚖️ Lower weight precision to shrink (FP16→INT8→INT4). Same idea as photo compression.
- 📉 ~4x less memory at 4-bit. 70B from 140GB→35GB; 8B ~5GB on a consumer GPU.
- 🎯 Small accuracy loss. INT8 nearly lossless; INT4 under 4% for general use (mind math/code).
- 🛠️ Methods: GPTQ / AWQ / GGUF (Ollama) / QLoRA. GGUF is easiest locally.
- 🔀 Different from distillation/FT: lighten the same vessel / move to a smaller vessel / add domain knowledge.
"Keep the smartness, drop only the weight." Quantization is the most practical single step for making AI accessible. Start by running a Q4 model on a local LLM. For a related technique, see model distillation; for the foundation, LLM weights.
FAQ
Q. Does quantization make the model dumber?
A. Less than you'd think. INT8 is nearly lossless, and even INT4 reportedly degrades under 4% on general Q&A and commonsense tasks. But the gap is more noticeable for math, code generation, and hard reasoning, so pick the bit-width to match the use case.
Q. What are Q4 / Q8, and which should I choose?
A. They're GGUF quantization levels — smaller numbers are lighter (coarser). Tight on VRAM, pick Q4; for quality with room to spare, Q8. For general use like chat or summarization, Q4 is often comfortable.
Q. Should I use quantization or distillation?
A. Different goals. To lighten a model you already have, quantize it; to create a brand-new smaller dedicated model, use distillation. They're often combined — quantizing a distilled small model further is common.
Q. Do I need to quantize models myself?
A. Usually not. Major models are already distributed in quantized form and can be downloaded and used right away via tools like Ollama. Quantizing yourself is only for custom models or special requirements.