Module 08 · Section 8.1

Model Quantization

Shrinking models from 16-bit floats to 4-bit integers while preserving quality

Quantization is the fine art of convincing a 70-billion-parameter model that it never really needed all those decimal places. Surprisingly, it usually agrees.

A Compressed But Undiminished Neural Network
★ Big Picture

Why quantize? A 70B-parameter model stored in FP16 requires approximately 140 GB of GPU memory just for the weights. That exceeds the capacity of even the largest single GPU (the A100 has 80 GB, the H100 has 80 GB). Quantization compresses weights from 16-bit or 32-bit floating point down to 8-bit, 4-bit, or even lower precision integers. A 4-bit quantized 70B model fits in roughly 35 GB, making it servable on a single GPU. The key challenge is performing this compression without destroying the model's capabilities. This section covers the mathematics of quantization, the major algorithms (GPTQ, AWQ, bitsandbytes), and practical techniques for evaluating the quality tradeoff.

⚙ Prerequisites

This section assumes understanding of floating-point number representation and PyTorch tensor operations from Section 0.2. The matrix multiplication concepts from Module 04 (attention computations) are essential for understanding where quantization is applied.

ⓘ Note

Intuition: Quantization is like reducing the color depth of an image. A photo in 24-bit color uses 16.7 million distinct colors. Reduce it to 8-bit (256 colors) and the image is 3x smaller with barely visible quality loss. Reduce further to 4-bit (16 colors) and you start to see artifacts, but the image remains recognizable. Model quantization works the same way: reducing the precision of each weight from 16-bit to 4-bit shrinks the model 4x, with a small and often acceptable quality tradeoff.

1. Why Inference Is Expensive

During autoregressive generation, the model produces one token at a time. Each token requires a full forward pass through every layer, reading all model weights from GPU memory. For a 70B model in FP16, this means transferring 140 GB of data per token through the memory bus. On an A100 with 2 TB/s memory bandwidth, merely reading the weights takes about 70 milliseconds. The actual computation (matrix multiplications) takes far less time. This makes LLM inference memory-bandwidth-bound, not compute-bound.

Quantization helps in two complementary ways. First, smaller weights mean less data to transfer from memory, directly improving throughput. Second, smaller weights mean the entire model may fit on fewer (or smaller) GPUs, reducing hardware costs. A model quantized to 4-bit occupies one quarter of the original memory, so weight transfer is roughly 4x faster.

2. Quantization Mathematics

2.1 Absmax (Symmetric) Quantization

The simplest quantization scheme maps a floating-point tensor to integers using only a scale factor. For an n-bit signed integer representation with range [−2n−1, 2n−1−1], the quantization formula is:

scale = max(|X|) / (2n−1 − 1)
Xq = round(X / scale)
X̂ = Xq × scale

Here, X is the original floating-point tensor, Xq is the quantized integer tensor, and is the dequantized approximation. The zero point in the floating-point space always maps to integer zero, which is why this scheme is called symmetric. It works well when values are roughly centered around zero, which is typically true for neural network weights.

2.2 Zero-Point (Asymmetric) Quantization

When the tensor values are not symmetric around zero (common for activations, which often have a positive bias), asymmetric quantization adds a zero-point offset:

scale = (max(X) − min(X)) / (2n − 1)
zero_point = round(−min(X) / scale)
Xq = round(X / scale) + zero_point

This maps the full range [min(X), max(X)] onto the unsigned integer range [0, 2n−1]. Dequantization reverses the process: X̂ = (Xq − zero_point) × scale. The extra zero-point parameter adds slight overhead but significantly reduces quantization error for skewed distributions.

2.3 Granularity: Per-Tensor, Per-Channel, Per-Group

The scale (and zero-point) can be computed at different granularities:

Per-Tensor Entire matrix: 1 scale s = 0.042 Per-Channel Each row: 1 scale s₁=0.031 s₂=0.058 s₃=0.019 Per-Group (g=128) Each group of 128 values: 1 scale + 1 zero-point grp 0: s=0.022 grp 1: s=0.041 grp 2: s=0.015 grp 3: s=0.063 ... grp N: s=0.029 Best accuracy: outliers only affect their own group Storage overhead: 1 FP16 scale per 128 INT4 values = +0.125 bits/value
Figure 8.1: Quantization granularity levels. Per-group quantization (bottom) provides the best accuracy for 4-bit models.

3. Data Types for Quantization

Data TypeBitsRangeUse Case
FP3232±3.4 × 1038Training (master weights)
FP16 / BF1616±65504 / ±3.4 × 1038Standard inference, mixed-precision training
FP8 (E4M3)8±448Hopper GPU inference, training forward pass
FP8 (E5M2)8±57344Training backward pass (wider range)
INT88−128 to 127Weight + activation quantization
INT44−8 to 7Weight-only quantization (GPTQ, AWQ)
NF4416 quantile levelsbitsandbytes / QLoRA

3.1 NF4: Normal Float 4-bit

NF4 is a special 4-bit data type designed by Tim Dettmers for use in QLoRA. The key insight is that neural network weights are approximately normally distributed. Instead of using uniformly spaced quantization levels (as standard INT4 does), NF4 places its 16 quantization levels at the quantiles of the standard normal distribution. This means each of the 16 bins captures approximately the same probability mass, making NF4 information-theoretically optimal for normally distributed data.

Key Insight

Standard INT4 wastes quantization levels in low-density tails and crowds them in the high-density center. NF4 fixes this by spacing levels at normal quantiles. The 16 NF4 values are precomputed: {−1.0, −0.6962, −0.5251, −0.3949, −0.2844, −0.1848, −0.0911, 0.0, 0.0796, 0.1609, 0.2461, 0.3379, 0.4407, 0.5626, 0.7230, 1.0}.

4. Post-Training Quantization Algorithms

4.1 GPTQ: Hessian-Based Optimal Rounding

GPTQ (Frantar et al., 2023) quantizes weights one layer at a time, using second-order (Hessian) information to minimize the output error of each layer. The algorithm processes columns of the weight matrix sequentially. For each column, it rounds weights to the nearest quantization level, then compensates for the rounding error by adjusting not-yet-quantized columns using the inverse Hessian. This compensation step is what makes GPTQ significantly better than naive round-to-nearest quantization.

The core update rule for quantizing column j is:

δj = (wj − quant(wj)) / [H−1]jj
wk ← wk − δj · [H−1]jk    for k > j

Here, H is the Hessian of the layer's squared error with respect to the weights, which equals XTX where X is a calibration dataset's activations. GPTQ requires a small calibration dataset (typically 128 samples from C4 or similar) and takes about 4 hours to quantize a 70B model on a single GPU.

4.2 AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., 2024) takes a different approach. Instead of adjusting rounding decisions per column, AWQ identifies which weight channels are most important by looking at activation magnitudes. Channels that consistently produce large activations are "salient" and should be quantized more carefully. AWQ applies a per-channel scaling factor s to the weights before quantization:

Ŵ = quant(W · diag(s)) · diag(s)−1

The scaling factor s is chosen to minimize the quantization error weighted by the typical activation magnitude for each channel. Salient channels get a larger scale, giving them more of the available quantization range. This is simple to implement, fast to run, and produces quality comparable to GPTQ.

GPTQ Layer-wise, column-by-column done current remaining compensate Uses Hessian H = XᵀX Needs calibration data (128 samples) ~4h for 70B model Best perplexity at 4-bit AWQ Channel-wise scaling s=1.0 s=3.2 s=1.1 s=4.1 salient channels get larger scale Scales from activation magnitudes Simple per-channel transform ~1h for 70B model Fastest quantization + good quality
Figure 8.2: GPTQ compensates for rounding errors across columns using the Hessian. AWQ protects salient channels by scaling them before quantization.

5. Quantization in Practice with bitsandbytes

The bitsandbytes library by Tim Dettmers provides the simplest path to quantized inference. It integrates directly with Hugging Face Transformers and supports both 8-bit (LLM.int8()) and 4-bit (NF4/FP4) loading. No calibration dataset is required; quantization happens on the fly during model loading.

# Example 1: Loading a model in 4-bit NF4 with bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 data type
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
    bnb_4bit_use_double_quant=True,       # Double quantization
)

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# Check memory usage
mem_bytes = model.get_memory_footprint()
print(f"Model memory: {mem_bytes / 1e9:.2f} GB")
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e9:.1f}B")

# Generate text
inputs = tokenizer("The key advantage of quantization is", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Model memory: 5.51 GB Parameters: 8.0B The key advantage of quantization is that it significantly reduces the memory footprint and computational requirements of large language models while maintaining most of their original performance. By representing weights with fewer bits, models can run on consumer hardware that would otherwise be insufficient for full-precision inference.
Note: Double Quantization

When bnb_4bit_use_double_quant=True, bitsandbytes applies a second round of quantization to the quantization constants themselves. Each group of 128 weights produces one FP32 scale value (4 bytes). Double quantization further quantizes these scales to FP8 with a block size of 256, reducing the overhead from 0.5 bits/parameter to approximately 0.37 bits/parameter. For a 70B model, this saves about 1 GB of memory.

6. GPTQ Quantization with AutoGPTQ

GPTQ (Frantar et al., 2022) uses Hessian-based optimal rounding to decide how to quantize each weight. The Hessian of the loss with respect to the weights captures the second-order sensitivity: weights where the Hessian has large eigenvalues are "sensitive" (small perturbations cause large loss increases), while weights with small eigenvalues are "insensitive." GPTQ processes weights one column at a time, using the Hessian information to (1) round each weight to the nearest quantized value, and (2) distribute the rounding error across not-yet-quantized columns to minimize the total loss increase. This column-by-column error compensation is what makes GPTQ so effective: it achieves near-optimal rounding decisions in a single pass through the weight matrix, taking minutes rather than the hours required by iterative methods.

# Example 2: Quantizing a model with GPTQ
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure GPTQ quantization
gptq_config = GPTQConfig(
    bits=4,                  # 4-bit quantization
    group_size=128,          # Per-group granularity
    desc_act=True,           # Activation order (better quality)
    dataset="c4",            # Calibration dataset
    tokenizer=tokenizer,
)

# Load and quantize the model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=gptq_config,
    device_map="auto",
)

# Save the quantized model
model.save_pretrained("./llama-8b-gptq-4bit")
tokenizer.save_pretrained("./llama-8b-gptq-4bit")
print("Quantized model saved successfully")
Quantizing model.layers: 100%|####| 32/32 [12:34<00:00, 23.56s/layer] Quantized model saved successfully

7. AWQ Quantization

# Example 3: Quantizing with AWQ using the autoawq library
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = "meta-llama/Llama-3.1-8B-Instruct"
quant_path = "./llama-8b-awq-4bit"

# Load model for quantization
model = AutoAWQForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Configure AWQ
quant_config = {
    "zero_point": True,      # Use asymmetric quantization
    "q_group_size": 128,     # Group size
    "w_bit": 4,              # 4-bit weights
    "version": "GEMM",       # Optimized GEMM kernels
}

# Quantize (uses calibration data internally)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f"AWQ model saved to {quant_path}")
print(f"Original size: ~16 GB (FP16)")
print(f"Quantized size: ~4.5 GB (INT4)")
AWQ: 100%|####| 32/32 [08:15<00:00, 15.47s/layer] AWQ model saved to ./llama-8b-awq-4bit Original size: ~16 GB (FP16) Quantized size: ~4.5 GB (INT4)

8. Calibration Strategies

Both GPTQ and AWQ require a calibration dataset to compute their respective statistics. The calibration data does not need to match the final use case. Commonly used datasets include:

The calibration strategies for choosing quantization parameters vary in sophistication:

9. Quality Degradation Analysis

Quantization always introduces some quality loss. The key question is whether this loss is acceptable for your application. The standard metric is perplexity on a held-out evaluation set (typically WikiText-2 or a domain-specific corpus).

Warning: Outlier Features

Some transformer models contain "outlier features": a small number of hidden dimensions with activation magnitudes 10x to 100x larger than the rest. These outliers appear starting at around the 6B parameter scale and become more prominent in larger models. Naive quantization of layers containing these outliers causes catastrophic quality degradation. The LLM.int8() algorithm in bitsandbytes handles this by keeping outlier dimensions in FP16 while quantizing the rest to INT8. GPTQ and AWQ also have mechanisms to protect salient channels.

# Example 4: Benchmarking quantization quality
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import time

def measure_perplexity(model, tokenizer, text, stride=512):
    """Calculate perplexity on a text sample."""
    encodings = tokenizer(text, return_tensors="pt")
    max_length = model.config.max_position_embeddings
    seq_len = encodings.input_ids.size(1)

    nlls = []
    for begin in range(0, seq_len, stride):
        end = min(begin + max_length, seq_len)
        input_ids = encodings.input_ids[:, begin:end].to(model.device)
        target_ids = input_ids.clone()
        target_ids[:, :-1] = -100  # Only compute loss on last token

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            nlls.append(outputs.loss.item())

    return torch.exp(torch.tensor(nlls).mean()).item()

def benchmark_generation(model, tokenizer, prompt, n_tokens=100):
    """Measure generation speed in tokens per second."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    torch.cuda.synchronize()
    start = time.perf_counter()
    outputs = model.generate(**inputs, max_new_tokens=n_tokens, do_sample=False)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start
    return n_tokens / elapsed

# Results table (pre-computed for Llama 3.1 8B on A100)
results = {
    "FP16":       {"ppl": 6.14, "tps": 42.3, "mem_gb": 16.1},
    "INT8":       {"ppl": 6.17, "tps": 68.1, "mem_gb": 8.5},
    "GPTQ-4bit":  {"ppl": 6.41, "tps": 95.7, "mem_gb": 4.8},
    "AWQ-4bit":   {"ppl": 6.38, "tps": 102.3, "mem_gb": 4.5},
    "NF4 (bnb)":  {"ppl": 6.45, "tps": 78.4, "mem_gb": 5.5},
}

print(f"{'Method':<15} {'Perplexity':>12} {'Tokens/sec':>12} {'Memory (GB)':>12}")
print("-" * 55)
for method, r in results.items():
    print(f"{method:<15} {r['ppl']:>12.2f} {r['tps']:>12.1f} {r['mem_gb']:>12.1f}")
Method Perplexity Tokens/sec Memory (GB) ------------------------------------------------------- FP16 6.14 42.3 16.1 INT8 6.17 68.1 8.5 GPTQ-4bit 6.41 95.7 4.8 AWQ-4bit 6.38 102.3 4.5 NF4 (bnb) 6.45 78.4 5.5
Key Insight

The perplexity increase from FP16 to 4-bit quantization is less than 5% for 8B+ models. Meanwhile, memory usage drops by 3x to 4x and inference speed roughly doubles. For most practical applications, 4-bit quantization is the sweet spot for serving LLMs on limited hardware. The quality gap narrows further as model size increases: 70B models lose less than 2% perplexity at 4-bit.

10. The GGUF Format and Local Inference

For local deployment, the GGUF (GPT-Generated Unified Format) file format has become the dominant standard. Created for the llama.cpp project and used by Ollama, GGUF stores quantized model weights in a single, self-contained file with embedded metadata (tokenizer, architecture parameters, quantization scheme).

GGUF supports a rich set of quantization methods called k-quants (Q2_K through Q6_K) that use mixed precision within each tensor. Instead of applying a uniform 4-bit quantization to every weight, k-quants assign different bit widths to different parts of the weight matrix based on sensitivity analysis. The most important attention and output layers receive higher precision (5 or 6 bits), while less sensitive feed-forward layers use 3 or 4 bits. This mixed-precision approach typically produces better quality than uniform quantization at the same average bits-per-weight.

GGUF QuantBits/WeightModel Size (7B)Quality
Q2_K~2.5~2.8 GBSignificant degradation
Q4_K_M~4.8~4.6 GBGood; recommended minimum
Q5_K_M~5.7~5.3 GBVery good; near-FP16
Q6_K~6.6~5.9 GBExcellent; minimal loss
Q8_08.0~7.2 GBNear-lossless
✍ Modify and Observe

Download a GGUF model from Hugging Face (search for "TheBloke" or "bartowski" for curated quantizations). Try running it with Ollama: ollama run llama3.1:8b-q4_K_M and then ollama run llama3.1:8b-q8_0. Ask the same factual and reasoning questions to both. Can you detect quality differences? Try math problems, code generation, and factual recall to see where lower quantization hurts most.

11. Quantization-Aware Training

Post-training quantization (PTQ) methods like GPTQ and AWQ compress an already-trained model. An alternative is quantization-aware training (QAT), where the model is trained (or fine-tuned) with simulated quantization in the forward pass. During training, weights are quantized and dequantized before each matrix multiplication. The backward pass uses the straight-through estimator (STE): gradients flow through the quantization operation as if it were the identity function, since the true gradient of rounding is zero almost everywhere.

QAT typically produces higher quality than PTQ at the same bit width, because the model learns to compensate for quantization noise during training. However, it requires access to training data and compute, making it impractical for many scenarios where PTQ is the only option.

Check Your Understanding

1. Why is per-group quantization preferred over per-tensor for 4-bit models?

Show Answer
Per-tensor quantization uses a single scale factor for the entire weight matrix. If even one extreme outlier value exists, the scale must accommodate it, leaving most of the quantization range underutilized. Per-group quantization (typically groups of 128) computes a separate scale for each group. Outliers only affect their local group, while other groups retain fine-grained resolution. The storage overhead is minimal: one FP16 scale per 128 INT4 values adds only 0.125 bits per parameter.

2. What is the fundamental difference between how GPTQ and AWQ handle quantization error?

Show Answer
GPTQ processes weight matrix columns sequentially and uses the inverse Hessian to redistribute the rounding error from each quantized column to the remaining unquantized columns. This is a direct error compensation approach. AWQ, by contrast, identifies salient weight channels (those with large activation magnitudes) and applies a per-channel scaling before quantization. Scaling up salient channels gives them more of the available quantization range, protecting the most important weights. AWQ is simpler and faster to run, while GPTQ can achieve slightly better perplexity in some cases.

3. Why does NF4 use non-uniform quantization levels, and why is it better than standard INT4 for neural network weights?

Show Answer
Neural network weights are approximately normally distributed, with most values clustered near zero and few values in the tails. Standard INT4 uses uniformly spaced levels, which wastes resolution in the sparse tails while under-resolving the dense center. NF4 places its 16 quantization levels at the quantiles of the standard normal distribution. Each of the 16 bins captures approximately 1/16th of the probability mass, meaning every bin is equally likely to be used. This is information-theoretically optimal for normally distributed data, maximizing the effective information captured per bit.

4. A model has 70 billion parameters in FP16. How much memory do the weights require, and approximately how much at INT4 with per-group quantization (group size 128)?

Show Answer
In FP16, each parameter is 2 bytes, so 70B parameters require 140 GB. In INT4, each parameter is 0.5 bytes, giving 35 GB for the weights. Per-group quantization with group size 128 adds one FP16 scale (2 bytes) per 128 values, which is 2/128 = 0.015625 bytes per parameter, or about 1.09 GB for 70B parameters. Total INT4 memory is approximately 36.1 GB, making it possible to fit on a single 48 GB GPU (A6000 or H100).

Key Takeaways