What Is Quantization? Shrinking AI Models, Explained

Q: Does quantization make the model dumber?

Less than you&#039;d think. INT8 is nearly lossless, and even INT4 reportedly degrades under 4% on general Q&amp;A and commonsense tasks. But the gap is more noticeable for math, code generation, and hard reasoning, so pick the bit-width to match the use case.

Q: What are Q4 / Q8, and which should I choose?

They&#039;re GGUF quantization levels — smaller numbers are lighter (coarser). Tight on VRAM, pick Q4; for quality with room to spare, Q8. For general use like chat or summarization, Q4 is often comfortable.

What Is Quantization? Shrinking AI Models to Run Them on Your Own Machine

Contents

1. What is quantization? Like compressing a photo
2. How much lighter? (the memory numbers)
3. How much accuracy is lost?
4. Main methods: GPTQ / AWQ / GGUF / QLoRA
5. vs distillation and fine-tuning
6. How to start and pick the bit-width
Summary
FAQ

"A huge 70B (70-billion-parameter) model runs on a single gaming PC at home, not a rack of data-center GPUs." What makes this possible is quantization — a technique that lowers the numerical precision of a model's weights to dramatically shrink its size and memory needs.

Whereas last time's model distillation "moved knowledge into a separate, smaller model," quantization "makes the same model lighter." This article explains it with a photo-compression analogy, covers how much lighter it gets (the memory numbers), the accuracy trade-off, the main methods (GPTQ / AWQ / GGUF / QLoRA), and how to run it locally — all for beginners.

QUANTIZATION · SHRINK BY LOWERING PRECISION

Lower the bit-width, and VRAM drops sharply

— Example: memory needed for a 70B model

FP16

~140GB

INT8

~70GB

INT4

~35GB

~4x less memory at 4-bit Runs on a consumer GPU A small accuracy drop

* Memory estimates and figures in this article are quoted from public materials (as of June 2026). Actual needs vary by model, format, and context length — read them as directional.

1. What is quantization? Like compressing a photo

Quantization means lowering the numerical precision of a model's weights (parameters). AI weights are usually stored as FP16/FP32 (16/32-bit decimals), and quantization replaces them with integers like INT8 (8-bit) or INT4 (4-bit). Each weight then takes less space, and the whole model gets much lighter.

Think of it as "compressing a high-resolution photo": the original RAW photo (FP16) is beautiful but huge. Compress it to JPEG (INT8/INT4) and the file shrinks to a fraction of the size while looking almost identical. Quantization is the same — sacrifice a little precision for a big reduction in weight. The surprise isn't that it works, but how little you give up.

The number and role of the weights don't change — the vessel (model) stays the same; only the fineness of the representation is made coarser. So knowing the model's structure helps (see how LLM weights work).

2. How much lighter? (the memory numbers)

The effect is obvious in numbers. Per weight: FP32 = 4 bytes, INT8 = 1 byte, INT4 = 0.5 bytes. So going 4-bit uses about one-quarter the memory of FP16.

Precision	Per weight	70B model (approx.)	8B model (approx.)
FP16 (no quantization)	2 bytes	~140GB	~16GB
INT8	1 byte	~70GB	~8GB
INT4	0.5 bytes	~35GB	~4.5-5GB

* Estimates. Actual values vary with format, overhead, and context length.

The impact is huge. If a 70B model goes from 140GB to 35GB, it runs on a realistic setup instead of several A100s. Quantize an 8B model to 4-bit and it's about 5GB — fitting comfortably in a midrange GPU (8GB VRAM), so you can run it locally on your own PC. This is why quantization is called the "democratization of LLMs."

3. How much accuracy is lost?

The worry is: "won't it get dumber once it's lighter?" The answer is "less than you'd think — but it depends on the bit-width and the task."

🟢 INT8: nearly lossless

For most LLMs, the performance drop is minimal. A safe choice when you want to halve memory while keeping quality.

🟡 INT4: practical with smart methods

For general Q&A and commonsense tasks, degradation is reportedly under 4%. But for math, code generation, and hard reasoning, the loss is more noticeable, so take care.

The accuracy loss shows up technically as "a small rise in perplexity." The key is to "pick the bit-width that fits the task" — INT4 is often plenty for chat or summarization, but for code generation or exact math, consider INT8 or no quantization. Ultimately, evaluate on your own task to confirm it's within tolerance.

4. Main methods: GPTQ / AWQ / GGUF / QLoRA

There are several representative quantization methods and formats. Knowing the names helps you choose models and tools without confusion.

Method / format	Traits	Best for
GPTQ	The pioneer that achieved 4-bit compression while keeping accuracy.	GPU inference
AWQ	Identifies and protects the ~1% most important weights. Often 1-2% more accurate and faster than GPTQ.	Fast, efficient production inference
GGUF	The llama.cpp / Ollama format. Choose levels Q2_K-Q8_0; supports CPU+GPU hybrid.	Running locally on your PC
QLoRA	Combines a 4-bit base model with LoRA, enabling fine-tuning on a consumer GPU.	Low-cost fine-tuning

For a beginner trying it locally, using a GGUF model with Ollama is the easiest path. To optimize production GPU inference, AWQ is a strong choice. To fine-tune a big model cheaply, QLoRA — remembering just that is enough.

5. vs distillation and fine-tuning

Quantization is a "model efficiency/optimization" technique alongside distillation and fine-tuning. They're easy to confuse, so note the difference in goals.

⚖️ Quantization

Make the same model's weights lighter. Same model inside, just a coarser representation.

🧑‍🏫 Distillation

Move knowledge into a separate, smaller model. Rebuild the vessel smaller.

🎯 Fine-tuning

Further-train for a specific use. Roughly the same size; adds domain knowledge.

The three aren't exclusive — they're usually combined. For example, "quantize a student model that was distilled smaller, to fit it on a phone," or, as with QLoRA, "fine-tune on a quantized base." They stack.

6. How to start and pick the bit-width

No tricky implementation needed. Many already-quantized models are distributed, so you can just download and use them. When unsure, pick by this guide.

To try locally first, use GGUF (Ollama)

Run a quantized model with Ollama in one command. Just touching it is the fastest way to learn.

Pick the bit-width by your VRAM

Tight on VRAM? INT4 (Q4). Have room and want quality? INT8 (Q8). General use is often fine on Q4.

Judge precision by the use case

For code generation or exact math, avoid INT4 and use INT8+. For chat and summarization, INT4 is comfortable.

Summary

Quantization is the key enabler that turns a giant AI into something light enough to run on your own machine. Let's recap.

Key takeaways

⚖️ Lower weight precision to shrink (FP16→INT8→INT4). Same idea as photo compression.
📉 ~4x less memory at 4-bit. 70B from 140GB→35GB; 8B ~5GB on a consumer GPU.
🎯 Small accuracy loss. INT8 nearly lossless; INT4 under 4% for general use (mind math/code).
🛠️ Methods: GPTQ / AWQ / GGUF (Ollama) / QLoRA. GGUF is easiest locally.
🔀 Different from distillation/FT: lighten the same vessel / move to a smaller vessel / add domain knowledge.

"Keep the smartness, drop only the weight." Quantization is the most practical single step for making AI accessible. Start by running a Q4 model on a local LLM. For a related technique, see model distillation; for the foundation, LLM weights.

FAQ

Q. Does quantization make the model dumber?

A. Less than you'd think. INT8 is nearly lossless, and even INT4 reportedly degrades under 4% on general Q&A and commonsense tasks. But the gap is more noticeable for math, code generation, and hard reasoning, so pick the bit-width to match the use case.

Q. What are Q4 / Q8, and which should I choose?

A. They're GGUF quantization levels — smaller numbers are lighter (coarser). Tight on VRAM, pick Q4; for quality with room to spare, Q8. For general use like chat or summarization, Q4 is often comfortable.

Q. Should I use quantization or distillation?

A. Different goals. To lighten a model you already have, quantize it; to create a brand-new smaller dedicated model, use distillation. They're often combined — quantizing a distilled small model further is common.

Q. Do I need to quantize models myself?

A. Usually not. Major models are already distributed in quantized form and can be downloaded and used right away via tools like Ollama. Quantizing yourself is only for custom models or special requirements.

What Is Quantization? Shrinking AI Models to Run Them on Your Own Machine

Lower the bit-width, and VRAM drops sharply

1. What is quantization? Like compressing a photo

2. How much lighter? (the memory numbers)

3. How much accuracy is lost?

4. Main methods: GPTQ / AWQ / GGUF / QLoRA

5. vs distillation and fine-tuning

6. How to start and pick the bit-width

Summary

FAQ

Related Articles

Generative AI Knowledge Cutoff Dates Compared: ChatGPT, Claude, Gemini & More

What Is Generative AI? How It Differs from Traditional AI

Generative AI Strengths and Weaknesses — What It Can and Cannot Do with Real Examples

What Is an LLM? How Large Language Models Work, Top Models & Use Cases

Comments

Leave a Comment