Section 8.1: Model Quantization

Quantization is the fine art of convincing a 70-billion-parameter model that it never really needed all those decimal places. Surprisingly, it usually agrees.

A Compressed But Undiminished Neural Network

★ Big Picture

Why quantize? A 70B-parameter model stored in FP16 requires approximately 140 GB of GPU memory just for the weights. That exceeds the capacity of even the largest single GPU (the A100 has 80 GB, the H100 has 80 GB). Quantization compresses weights from 16-bit or 32-bit floating point down to 8-bit, 4-bit, or even lower precision integers. A 4-bit quantized 70B model fits in roughly 35 GB, making it servable on a single GPU. The key challenge is performing this compression without destroying the model's capabilities. This section covers the mathematics of quantization, the major algorithms (GPTQ, AWQ, bitsandbytes), and practical techniques for evaluating the quality tradeoff.

⚙ Prerequisites

This section assumes understanding of floating-point number representation and PyTorch tensor operations from Section 0.2. The matrix multiplication concepts from Module 04 (attention computations) are essential for understanding where quantization is applied.

ⓘ Note

Intuition: Quantization is like reducing the color depth of an image. A photo in 24-bit color uses 16.7 million distinct colors. Reduce it to 8-bit (256 colors) and the image is 3x smaller with barely visible quality loss. Reduce further to 4-bit (16 colors) and you start to see artifacts, but the image remains recognizable. Model quantization works the same way: reducing the precision of each weight from 16-bit to 4-bit shrinks the model 4x, with a small and often acceptable quality tradeoff.

1. Why Inference Is Expensive

During autoregressive generation, the model produces one token at a time. Each token requires a full forward pass through every layer, reading all model weights from GPU memory. For a 70B model in FP16, this means transferring 140 GB of data per token through the memory bus. On an A100 with 2 TB/s memory bandwidth, merely reading the weights takes about 70 milliseconds. The actual computation (matrix multiplications) takes far less time. This makes LLM inference memory-bandwidth-bound, not compute-bound.

Quantization helps in two complementary ways. First, smaller weights mean less data to transfer from memory, directly improving throughput. Second, smaller weights mean the entire model may fit on fewer (or smaller) GPUs, reducing hardware costs. A model quantized to 4-bit occupies one quarter of the original memory, so weight transfer is roughly 4x faster.

2. Quantization Mathematics

2.1 Absmax (Symmetric) Quantization

The simplest quantization scheme maps a floating-point tensor to integers using only a scale factor. For an n-bit signed integer representation with range [−2ⁿ⁻¹, 2ⁿ⁻¹−1], the quantization formula is:

scale = max(|X|) / (2 n-1 - 1)

X q = round(X / scale)

X̂ = X q \times scale

Here, X is the original floating-point tensor, X_q is the quantized integer tensor, and X̂ is the dequantized approximation. The zero point in the floating-point space always maps to integer zero, which is why this scheme is called symmetric. It works well when values are roughly centered around zero, which is typically true for neural network weights.

2.2 Zero-Point (Asymmetric) Quantization

When the tensor values are not symmetric around zero (common for activations, which often have a positive bias), asymmetric quantization adds a zero-point offset:

scale = (max(X) - min(X)) / (2 n - 1)

zero_point = round(-min(X) / scale)

X q = round(X / scale) + zero_point

This maps the full range [min(X), max(X)] onto the unsigned integer range [0, 2ⁿ−1]. Dequantization reverses the process: X̂ = (X_q − zero_point) × scale. The extra zero-point parameter adds slight overhead but significantly reduces quantization error for skewed distributions.

2.3 Granularity: Per-Tensor, Per-Channel, Per-Group

The scale (and zero-point) can be computed at different granularities:

Per-tensor: One scale for the entire weight matrix. Simplest and fastest, but the largest outlier in the tensor dominates the scale for all values.
Per-channel: One scale per output channel (row of the weight matrix). Common for INT8 quantization. Each row gets its own range, reducing the impact of outliers.
Per-group: One scale per group of g consecutive values (typically g = 128). This is the standard for 4-bit quantization (GPTQ, AWQ, bitsandbytes). The overhead of storing extra scales is small (one FP16 scale per 128 INT4 values adds only 0.125 bits per value), but the accuracy improvement is substantial.

Figure 8.1: Quantization granularity levels. Per-group quantization (bottom) provides the best accuracy for 4-bit models.

3. Data Types for Quantization

Data Type	Bits	Range	Use Case
FP32	32	±3.4 × 10³⁸	Training (master weights)
FP16 / BF16	16	±65504 / ±3.4 × 10³⁸	Standard inference, mixed-precision training
FP8 (E4M3)	8	±448	Hopper GPU inference, training forward pass
FP8 (E5M2)	8	±57344	Training backward pass (wider range)
INT8	8	−128 to 127	Weight + activation quantization
INT4	4	−8 to 7	Weight-only quantization (GPTQ, AWQ)
NF4	4	16 quantile levels	bitsandbytes / QLoRA

3.1 NF4: Normal Float 4-bit

NF4 is a special 4-bit data type designed by Tim Dettmers for use in QLoRA. The key insight is that neural network weights are approximately normally distributed. Instead of using uniformly spaced quantization levels (as standard INT4 does), NF4 places its 16 quantization levels at the quantiles of the standard normal distribution. This means each of the 16 bins captures approximately the same probability mass, making NF4 information-theoretically optimal for normally distributed data.

Key Insight

Standard INT4 wastes quantization levels in low-density tails and crowds them in the high-density center. NF4 fixes this by spacing levels at normal quantiles. The 16 NF4 values are precomputed: {−1.0, −0.6962, −0.5251, −0.3949, −0.2844, −0.1848, −0.0911, 0.0, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0}.

4. Post-Training Quantization Algorithms

4.1 GPTQ: Hessian-Based Optimal Rounding

GPTQ (Frantar et al., 2023) quantizes weights one layer at a time, using second-order (Hessian) information to minimize the output error of each layer. The algorithm processes columns of the weight matrix sequentially. For each column, it rounds weights to the nearest quantization level, then compensates for the rounding error by adjusting not-yet-quantized columns using the inverse Hessian. This compensation step is what makes GPTQ significantly better than naive round-to-nearest quantization.

The core update rule for quantizing column j is:

δ j = (w j - quant(w j)) / [H -1] jj

w k \leftarrow w k - δ j \cdot [H -1] jk for k > j

Here, H is the Hessian of the layer's squared error with respect to the weights, which equals X^TX where X is a calibration dataset's activations. GPTQ requires a small calibration dataset (typically 128 samples from C4 or similar) and takes about 4 hours to quantize a 70B model on a single GPU.

4.2 AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2024) takes a different approach. Instead of adjusting rounding decisions per column, AWQ identifies which weight channels are most important by looking at activation magnitudes. Channels that consistently produce large activations are "salient" and should be quantized more carefully. AWQ applies a per-channel scaling factor s to the weights before quantization:

Ŵ = quant(W \cdot diag(s)) \cdot diag(s) -1

The scaling factor s is chosen to minimize the quantization error weighted by the typical activation magnitude for each channel. Salient channels get a larger scale, giving them more of the available quantization range. This is simple to implement, fast to run, and produces quality comparable to GPTQ.

Figure 8.2: GPTQ compensates for rounding errors across columns using the Hessian. AWQ protects salient channels by scaling them before quantization.

5. Quantization in Practice with bitsandbytes

The bitsandbytes library by Tim Dettmers provides the simplest path to quantized inference. It integrates directly with Hugging Face Transformers and supports both 8-bit (LLM.int8()) and 4-bit (NF4/FP4) loading. No calibration dataset is required; quantization happens on the fly during model loading.

# Example 1: Loading a model in 4-bit NF4 with bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 data type
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
    bnb_4bit_use_double_quant=True,       # Double quantization
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# Check memory usage
mem_bytes = model.get_memory_footprint()
print(f"Model memory: {mem_bytes / 1e9:.2f} GB")
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B")

# Generate text
inputs = tokenizer("The key advantage of quantization is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model memory: 5.51 GB Parameters: 8.0B The key advantage of quantization is that it significantly reduces the memory footprint and computational requirements of large language models while maintaining most of their original performance. By representing weights with fewer bits, models can run on consumer hardware that would otherwise be insufficient for full-precision inference.

Note: Double Quantization

When bnb_4bit_use_double_quant=True, bitsandbytes applies a second round of quantization to the quantization constants themselves. Each group of 128 weights produces one FP32 scale value (4 bytes). Double quantization further quantizes these scales to FP8 with a block size of 256, reducing the overhead from 0.5 bits/parameter to approximately 0.37 bits/parameter. For a 70B model, this saves about 1 GB of memory.

6. GPTQ Quantization with AutoGPTQ

GPTQ (Frantar et al., 2022) uses Hessian-based optimal rounding to decide how to quantize each weight. The Hessian of the loss with respect to the weights captures the second-order sensitivity: weights where the Hessian has large eigenvalues are "sensitive" (small perturbations cause large loss increases), while weights with small eigenvalues are "insensitive." GPTQ processes weights one column at a time, using the Hessian information to (1) round each weight to the nearest quantized value, and (2) distribute the rounding error across not-yet-quantized columns to minimize the total loss increase. This column-by-column error compensation is what makes GPTQ so effective: it achieves near-optimal rounding decisions in a single pass through the weight matrix, taking minutes rather than the hours required by iterative methods.

# Example 2: Quantizing a model with GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure GPTQ quantization
gptq_config = GPTQConfig(
    bits=4,                  # 4-bit quantization
    group_size=128,          # Per-group granularity
    desc_act=True,           # Activation order (better quality)
    dataset="c4",            # Calibration dataset
    tokenizer=tokenizer,
)

# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=gptq_config,
    device_map="auto",
)

# Save the quantized model
model.save_pretrained("./llama-8b-gptq-4bit")
tokenizer.save_pretrained("./llama-8b-gptq-4bit")
print("Quantized model saved successfully")

Quantizing model.layers: 100%|####| 32/32 [12:34<00:00, 23.56s/layer] Quantized model saved successfully

7. AWQ Quantization

# Example 3: Quantizing with AWQ using the autoawq library
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-8b-awq-4bit"

# Load model for quantization
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure AWQ
quant_config = {
    "zero_point": True,      # Use asymmetric quantization
    "q_group_size": 128,     # Group size
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM",       # Optimized GEMM kernels
}

# Quantize (uses calibration data internally)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f"AWQ model saved to {quant_path}")
print(f"Original size: ~16 GB (FP16)")
print(f"Quantized size: ~4.5 GB (INT4)")

AWQ: 100%|####| 32/32 [08:15<00:00, 15.47s/layer] AWQ model saved to ./llama-8b-awq-4bit Original size: ~16 GB (FP16) Quantized size: ~4.5 GB (INT4)

8. Calibration Strategies

Both GPTQ and AWQ require a calibration dataset to compute their respective statistics. The calibration data does not need to match the final use case. Commonly used datasets include:

C4 (Colossal Clean Crawled Corpus): The most common default. General web text that captures broad language patterns. 128 samples of 2048 tokens is standard.
WikiText-2: Clean Wikipedia text. Slightly less diverse than C4 but more consistent.
Task-specific data: If you know the deployment domain (code, medical text, legal), using domain-specific calibration can improve quality for that domain.

The calibration strategies for choosing quantization parameters vary in sophistication:

Min/Max: Use the minimum and maximum observed values. Simple but sensitive to outliers.
Percentile: Use the 99.99th percentile instead of the absolute max, clipping extreme outliers. Reduces error for the majority of values at the cost of clipping a few.
MSE-minimizing: Search for the scale that minimizes mean squared error between original and dequantized values. More expensive but more accurate.
Cross-entropy-minimizing: Choose parameters that minimize the cross-entropy loss on the calibration data. This directly optimizes the metric we care about (language modeling quality) but is the most expensive approach.

9. Quality Degradation Analysis

Quantization always introduces some quality loss. The key question is whether this loss is acceptable for your application. The standard metric is perplexity on a held-out evaluation set (typically WikiText-2 or a domain-specific corpus).

Warning: Outlier Features

Some transformer models contain "outlier features": a small number of hidden dimensions with activation magnitudes 10x to 100x larger than the rest. These outliers appear starting at around the 6B parameter scale and become more prominent in larger models. Naive quantization of layers containing these outliers causes catastrophic quality degradation. The LLM.int8() algorithm in bitsandbytes handles this by keeping outlier dimensions in FP16 while quantizing the rest to INT8. GPTQ and AWQ also have mechanisms to protect salient channels.

# Example 4: Benchmarking quantization quality
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

def measure_perplexity(model, tokenizer, text, stride=512):
    """Calculate perplexity on a text sample."""
    encodings = tokenizer(text, return_tensors="pt")
    max_length = model.config.max_position_embeddings
    seq_len = encodings.input_ids.size(1)

    nlls = []
    for begin in range(0, seq_len, stride):
        end = min(begin + max_length, seq_len)
        input_ids = encodings.input_ids[:, begin:end].to(model.device)
        target_ids = input_ids.clone()
        target_ids[:, :-1] = -100  # Only compute loss on last token

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            nlls.append(outputs.loss.item())

    return torch.exp(torch.tensor(nlls).mean()).item()

def benchmark_generation(model, tokenizer, prompt, n_tokens=100):
    """Measure generation speed in tokens per second."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    torch.cuda.synchronize()
    start = time.perf_counter()
    outputs = model.generate(**inputs, max_new_tokens=n_tokens, do_sample=False)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    return n_tokens / elapsed

# Results table (pre-computed for Llama 3.1 8B on A100)
results = {
    "FP16":       {"ppl": 6.14, "tps": 42.3, "mem_gb": 16.1},
    "INT8":       {"ppl": 6.17, "tps": 68.1, "mem_gb": 8.5},
    "GPTQ-4bit":  {"ppl": 6.41, "tps": 95.7, "mem_gb": 4.8},
    "AWQ-4bit":   {"ppl": 6.38, "tps": 102.3, "mem_gb": 4.5},
    "NF4 (bnb)":  {"ppl": 6.45, "tps": 78.4, "mem_gb": 5.5},
}

print(f"{'Method':<15} {'Perplexity':>12} {'Tokens/sec':>12} {'Memory (GB)':>12}")
print("-" * 55)
for method, r in results.items():
    print(f"{method:<15} {r['ppl']:>12.2f} {r['tps']:>12.1f} {r['mem_gb']:>12.1f}")

Method Perplexity Tokens/sec Memory (GB) ------------------------------------------------------- FP16 6.14 42.3 16.1 INT8 6.17 68.1 8.5 GPTQ-4bit 6.41 95.7 4.8 AWQ-4bit 6.38 102.3 4.5 NF4 (bnb) 6.45 78.4 5.5

Key Insight

The perplexity increase from FP16 to 4-bit quantization is less than 5% for 8B+ models. Meanwhile, memory usage drops by 3x to 4x and inference speed roughly doubles. For most practical applications, 4-bit quantization is the sweet spot for serving LLMs on limited hardware. The quality gap narrows further as model size increases: 70B models lose less than 2% perplexity at 4-bit.

10. The GGUF Format and Local Inference

For local deployment, the GGUF (GPT-Generated Unified Format) file format has become the dominant standard. Created for the llama.cpp project and used by Ollama, GGUF stores quantized model weights in a single, self-contained file with embedded metadata (tokenizer, architecture parameters, quantization scheme).

GGUF supports a rich set of quantization methods called k-quants (Q2_K through Q6_K) that use mixed precision within each tensor. Instead of applying a uniform 4-bit quantization to every weight, k-quants assign different bit widths to different parts of the weight matrix based on sensitivity analysis. The most important attention and output layers receive higher precision (5 or 6 bits), while less sensitive feed-forward layers use 3 or 4 bits. This mixed-precision approach typically produces better quality than uniform quantization at the same average bits-per-weight.

GGUF Quant	Bits/Weight	Model Size (7B)	Quality
Q2_K	~2.5	~2.8 GB	Significant degradation
Q4_K_M	~4.8	~4.6 GB	Good; recommended minimum
Q5_K_M	~5.7	~5.3 GB	Very good; near-FP16
Q6_K	~6.6	~5.9 GB	Excellent; minimal loss
Q8_0	8.0	~7.2 GB	Near-lossless

✍ Modify and Observe

Download a GGUF model from Hugging Face (search for "TheBloke" or "bartowski" for curated quantizations). Try running it with Ollama: ollama run llama3.1:8b-q4_K_M and then ollama run llama3.1:8b-q8_0. Ask the same factual and reasoning questions to both. Can you detect quality differences? Try math problems, code generation, and factual recall to see where lower quantization hurts most.

11. Quantization-Aware Training

Post-training quantization (PTQ) methods like GPTQ and AWQ compress an already-trained model. An alternative is quantization-aware training (QAT), where the model is trained (or fine-tuned) with simulated quantization in the forward pass. During training, weights are quantized and dequantized before each matrix multiplication. The backward pass uses the straight-through estimator (STE): gradients flow through the quantization operation as if it were the identity function, since the true gradient of rounding is zero almost everywhere.

QAT typically produces higher quality than PTQ at the same bit width, because the model learns to compensate for quantization noise during training. However, it requires access to training data and compute, making it impractical for many scenarios where PTQ is the only option.

Check Your Understanding

1. Why is per-group quantization preferred over per-tensor for 4-bit models?

Show Answer

Per-tensor quantization uses a single scale factor for the entire weight matrix. If even one extreme outlier value exists, the scale must accommodate it, leaving most of the quantization range underutilized. Per-group quantization (typically groups of 128) computes a separate scale for each group. Outliers only affect their local group, while other groups retain fine-grained resolution. The storage overhead is minimal: one FP16 scale per 128 INT4 values adds only 0.125 bits per parameter.

2. What is the fundamental difference between how GPTQ and AWQ handle quantization error?

Show Answer

GPTQ processes weight matrix columns sequentially and uses the inverse Hessian to redistribute the rounding error from each quantized column to the remaining unquantized columns. This is a direct error compensation approach. AWQ, by contrast, identifies salient weight channels (those with large activation magnitudes) and applies a per-channel scaling before quantization. Scaling up salient channels gives them more of the available quantization range, protecting the most important weights. AWQ is simpler and faster to run, while GPTQ can achieve slightly better perplexity in some cases.

3. Why does NF4 use non-uniform quantization levels, and why is it better than standard INT4 for neural network weights?

Show Answer

Neural network weights are approximately normally distributed, with most values clustered near zero and few values in the tails. Standard INT4 uses uniformly spaced levels, which wastes resolution in the sparse tails while under-resolving the dense center. NF4 places its 16 quantization levels at the quantiles of the standard normal distribution. Each of the 16 bins captures approximately 1/16th of the probability mass, meaning every bin is equally likely to be used. This is information-theoretically optimal for normally distributed data, maximizing the effective information captured per bit.

4. A model has 70 billion parameters in FP16. How much memory do the weights require, and approximately how much at INT4 with per-group quantization (group size 128)?

Show Answer

In FP16, each parameter is 2 bytes, so 70B parameters require 140 GB. In INT4, each parameter is 0.5 bytes, giving 35 GB for the weights. Per-group quantization with group size 128 adds one FP16 scale (2 bytes) per 128 values, which is 2/128 = 0.015625 bytes per parameter, or about 1.09 GB for 70B parameters. Total INT4 memory is approximately 36.1 GB, making it possible to fit on a single 48 GB GPU (A6000 or H100).

Key Takeaways

Quantization compresses weights from 16-bit to 8-bit or 4-bit, reducing memory by 2x to 4x and improving inference throughput proportionally.
Per-group granularity (group size 128) is the standard for 4-bit quantization, balancing accuracy against minimal storage overhead.
NF4 uses non-uniform levels matched to the normal distribution of weights, making it information-theoretically optimal for neural networks.
GPTQ uses Hessian-based error compensation for the highest quality; AWQ uses activation-aware channel scaling for speed and simplicity; bitsandbytes provides zero-calibration on-the-fly quantization.
Quality loss at 4-bit is modest: typically less than 5% perplexity increase for 8B+ models, with the gap narrowing for larger models.
Calibration data need not match the deployment domain; 128 samples of general text (C4) is usually sufficient.
Quantization-aware training can recover most of the quality gap but requires training compute and data access.