Module 14 · Section 14.1

LoRA & QLoRA

Low-rank adaptation: training 0.1% of parameters while matching full fine-tuning quality
★ Big Picture

LoRA is the single most important technique for practical LLM fine-tuning. Instead of updating all model weights, LoRA freezes the pretrained model and injects small trainable low-rank matrices into each layer. This reduces trainable parameters by 100x or more, cuts GPU memory by 60-70%, and produces adapters that can be swapped at serving time without reloading the base model. QLoRA extends this further by quantizing the frozen weights to 4-bit, enabling fine-tuning of 70B models on a single 48GB GPU.

1. The Full Fine-Tuning Problem

When you fine-tune a model with full parameter updates, every weight in the model gets a gradient, an optimizer state (momentum and variance for Adam), and a copy of the updated weight. For a 7B parameter model in FP16, that means 14 GB just for the weights, plus roughly 42 GB for optimizer states, totaling over 56 GB of GPU memory. Scaling to 13B or 70B models makes this prohibitively expensive.

The key insight behind parameter-efficient methods is that the weight changes during fine-tuning are low-rank. Research has shown that when you compute the difference between a fine-tuned model and its pretrained base (the "task-specific delta"), this delta matrix has a very low intrinsic dimensionality. Most of the information in the update can be captured by a much smaller matrix.

Model SizeFull FT Memory (FP16 + Adam)LoRA Memory (r=16)QLoRA Memory (NF4, r=16)
7B~56 GB~16 GB~6 GB
13B~104 GB~28 GB~10 GB
70B~560 GB~160 GB~36 GB

2. LoRA Mathematics

2.1 The Core Decomposition

LoRA (Low-Rank Adaptation) works by expressing the weight update as a product of two small matrices. For a pretrained weight matrix W of dimension d × k, instead of computing a full update ΔW (also d × k), LoRA decomposes it as:

W' = W + ΔW = W + BA

where B is d × r and A is r × k, with the rank r being much smaller than both d and k. Typical values of r range from 4 to 64, while d and k are typically 4096 or larger. This means the number of trainable parameters drops from d × k (e.g., 16.7 million for a 4096 × 4096 matrix) to r × (d + k) (e.g., 131,072 for r=16).

W d × k Frozen + B d×r × A r × k Trainable = W' d × k Adapted Parameter Count Example (d=4096, k=4096, r=16): Full fine-tuning: d × k = 16,777,216 parameters LoRA: r × (d + k) = 16 × 8,192 = 131,072 parameters (0.78%)
Figure 1: LoRA decomposes the weight update into two small trainable matrices B and A, reducing parameters by ~128x.

2.2 Initialization and Scaling

LoRA uses a specific initialization strategy: matrix A is initialized with a random Gaussian distribution, and matrix B is initialized to all zeros. This means that at the start of training, BA = 0, and the model behaves exactly like the pretrained base. Training gradually moves the model away from this starting point.

The scaling factor α (alpha) controls the magnitude of the LoRA update. The actual update applied is:

W' = W + (α / r) · BA

The ratio α/r acts as a learning rate multiplier for the LoRA weights. A common convention is to set α = 2r (so the effective multiplier is 2), but the optimal value depends on the task. Increasing α relative to r makes the adaptation more aggressive; decreasing it keeps the model closer to the pretrained weights.

◆ Key Insight

When you double the rank r, you should also consider doubling α to maintain the same effective learning rate. Many practitioners set α = 2 × r as a starting point, then adjust based on validation performance. If training diverges, reduce α; if the model barely moves from the base, increase it.

2.3 Why Low-Rank Works

The effectiveness of low-rank adaptation rests on a remarkable empirical finding: the "intrinsic dimensionality" of fine-tuning updates is far lower than the full parameter count would suggest. When researchers analyzed the singular value decomposition of ΔW matrices from full fine-tuning runs, they found that a small number of singular values capture the vast majority of the update's information content. In many cases, ranks as low as 4 or 8 capture over 90% of the useful signal.

This makes intuitive sense. Fine-tuning typically adapts a model to a specific domain or task format. The knowledge required for this adaptation (new terminology, output format preferences, domain-specific reasoning patterns) is a small modification relative to the vast general knowledge encoded in the pretrained weights.

3. LoRA Hyperparameters in Practice

3.1 Rank (r) Selection

RankTrainable Params (7B model)Best ForRisk
4~2MSimple format adaptation, chat templatesMay underfit complex tasks
8~4MClassification, simple instruction followingGood default for most tasks
16~8MDomain adaptation, moderate complexitySlight increase in memory
32~16MComplex reasoning, code generationDiminishing returns begin
64~33MVery complex tasks, near full FT qualityMemory approaches full FT

3.2 Target Module Selection

Not all weight matrices benefit equally from LoRA adaptation. The standard practice is to apply LoRA to the attention projection matrices: q_proj, k_proj, v_proj, and o_proj. Research and practice have converged on the recommendation to also include the MLP layers (gate_proj, up_proj, down_proj) for best results, though this increases trainable parameters.

from peft import LoraConfig, get_peft_model, TaskType

# Standard configuration: attention layers only
lora_config_basic = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,           # alpha = 2r
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
)

# Recommended: attention + MLP layers for best quality
lora_config_full = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
)

# Apply LoRA to a model
model = get_peft_model(model, lora_config_full)
model.print_trainable_parameters()
trainable params: 13,631,488 || all params: 6,751,219,712 || trainable%: 0.2019
ⓘ Note

The target_modules names vary by model architecture. LLaMA uses q_proj, k_proj, etc. Mistral uses the same convention. GPT-NeoX uses query_key_value. Falcon uses query_key_value and dense. You can use target_modules="all-linear" in the PEFT library to automatically target all linear layers.

4. Complete LoRA Fine-Tuning Pipeline

Here is a complete, production-ready pipeline for LoRA fine-tuning using the Hugging Face ecosystem. This example fine-tunes a Llama-3 8B model on instruction-following data.

import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    TrainingArguments, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset

# 1. Load model and tokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 2. Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 3. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")

def format_instruction(example):
    if example["input"]:
        text = f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""
    else:
        text = f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""
    return {"text": text}

dataset = dataset.map(format_instruction)

# 4. Training arguments
training_args = TrainingArguments(
    output_dir="./lora-llama3-8b",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=200,
    optim="adamw_torch",
    max_grad_norm=1.0,
)

# 5. Train with SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    max_seq_length=2048,
    dataset_text_field="text",
)

trainer.train()
trainer.save_model("./lora-llama3-8b/final")

5. QLoRA: 4-bit Quantized LoRA

QLoRA (Quantized LoRA) combines three innovations to dramatically reduce memory requirements: NF4 (Normal Float 4-bit) quantization of the frozen base weights, double quantization to compress the quantization constants themselves, and paged optimizers that gracefully handle GPU memory spikes.

5.1 NF4 Quantization

NF4 is a data type specifically designed for normally distributed neural network weights. Unlike standard 4-bit integer quantization (which spaces values uniformly), NF4 places more quantization levels near zero where neural network weights concentrate. This yields significantly lower quantization error for the same bit budget.

NF4 vs Uniform INT4 Quantization Levels INT4 (Uniform) -1.0 0.0 +1.0 NF4 (Normal) -1.0 0.0 +1.0 NF4 places more levels near zero, where most weights cluster
Figure 2: NF4 quantization levels are denser near zero, matching the normal distribution of neural network weights.

5.2 Double Quantization

Standard quantization requires storing a scale factor and zero-point for each block of weights (typically blocks of 64). These quantization constants are stored in FP32, consuming additional memory. Double quantization applies a second round of quantization to these constants, compressing them from FP32 to FP8 and saving roughly 0.4 bits per parameter across the entire model. For a 70B model, this translates to approximately 3 GB of additional savings.

5.3 Paged Optimizers

During training, GPU memory usage can spike temporarily (for example, during gradient computation for a particularly long sequence). Paged optimizers use NVIDIA's unified memory feature to automatically page optimizer states between GPU and CPU memory when needed, preventing out-of-memory errors. The performance cost is minimal because these spikes are typically brief and infrequent.

5.4 Complete QLoRA Configuration

import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    BitsAndBytesConfig, TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# QLoRA: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",             # NF4 data type
    bnb_4bit_compute_dtype=torch.bfloat16,   # Compute in BF16
    bnb_4bit_use_double_quant=True,         # Double quantization
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare model for k-bit training (handles gradient checkpointing)
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",    # Target all linear layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

# Training with paged optimizer
training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    bf16=True,
    optim="paged_adamw_8bit",       # Paged optimizer
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    gradient_checkpointing=True,     # Save even more memory
    max_grad_norm=0.3,              # QLoRA paper recommendation
)

# The rest follows the same SFTTrainer pattern
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    max_seq_length=2048,
)
trainer.train()
⚠ Warning

QLoRA training is slower than standard LoRA (roughly 30-40% slower per step) because of the overhead of dequantizing weights for each forward pass. The tradeoff is worthwhile when your GPU memory is the binding constraint, but if you have enough VRAM for regular LoRA in BF16, that will train faster.

6. Adapter Merging Strategies

After training, you have two options for deployment: keep the adapter separate (and load it dynamically) or merge it permanently into the base model weights. Each approach has tradeoffs.

Deployment Options: Separate Adapters vs. Merged Weights Option A: Separate Adapters Base Model Adapter A Adapter B Adapter C + Hot-swap at serving time + One base serves many tasks + Small adapter files (~50MB) - Slight latency overhead Option B: Merged Weights Base + Adapter (Single merged model) + Zero serving overhead + Standard model format + Compatible with all tools - One model per task
Figure 3: Separate adapters enable multi-task serving from one base; merged weights eliminate inference overhead.
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# === Option A: Load adapter separately ===
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
# Load different adapters dynamically
model = PeftModel.from_pretrained(base_model, "./lora-medical")
# Switch to another adapter
model.load_adapter("./lora-legal", adapter_name="legal")
model.set_adapter("legal")

# === Option B: Merge and save ===
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-medical",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama3-medical-merged")
tokenizer.save_pretrained("./llama3-medical-merged")

# === Option C: Merge QLoRA (requires dequantization) ===
# Load the QLoRA model in full precision for merging
model = AutoPeftModelForCausalLM.from_pretrained(
    "./qlora-output",
    torch_dtype=torch.float16,
    device_map="auto",
    low_cpu_mem_usage=True,
)
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")

print("Merged model saved. Upload to HF Hub or serve with vLLM.")
◆ Key Insight

When merging QLoRA adapters, you must load the base model in higher precision (FP16 or BF16) first, then merge. This is because the merge operation (W' = W + (α/r) · BA) needs sufficient numerical precision. Merging in 4-bit would introduce unacceptable quantization noise. After merging, you can re-quantize the merged model to GGUF or AWQ for efficient serving.

7. LoRA Hyperparameter Tuning Guide

Finding optimal LoRA hyperparameters requires systematic experimentation. Here is a practical guide based on what works across a wide range of tasks and model sizes.

HyperparameterDefaultWhen to IncreaseWhen to Decrease
r (rank)16Complex tasks, large datasets, reasoningSimple format changes, small datasets
lora_alpha2 × rModel not adapting enoughTraining diverging, loss spiking
lora_dropout0.05Overfitting (val loss rises)Large dataset, underfitting
learning_rate2e-4Underfitting, slow convergenceDivergence, loss oscillation
max_grad_norm1.0Very stable trainingGradient spikes (try 0.3)
ⓘ Note

LoRA learning rates are typically 5-10x higher than full fine-tuning learning rates. This is because only a small fraction of parameters are being updated, so each update needs to have a larger effect. A learning rate of 2e-4 for LoRA corresponds roughly to 2e-5 for full fine-tuning in terms of per-step model change.

8. The PEFT Library Ecosystem

The Hugging Face peft library provides a unified interface for all parameter-efficient methods. Beyond basic LoRA, it supports loading adapters from the Hub, combining adapters, and quantized training workflows.

from peft import (
    PeftModel,
    PeftConfig,
    get_peft_model,
    LoraConfig,
    TaskType,
    AutoPeftModelForCausalLM,
)

# Load a LoRA adapter from Hugging Face Hub
model = AutoPeftModelForCausalLM.from_pretrained(
    "username/my-lora-adapter",   # Adapter repo on HF Hub
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Inspect adapter configuration
config = PeftConfig.from_pretrained("username/my-lora-adapter")
print(f"Base model: {config.base_model_name_or_path}")
print(f"Rank: {config.r}, Alpha: {config.lora_alpha}")
print(f"Target modules: {config.target_modules}")

# Push adapter to Hub (only saves the small adapter weights)
model.push_to_hub("username/my-lora-adapter")
# Adapter size: typically 50-200 MB vs 14+ GB for full model

Section 14.1 Quiz

1. In the LoRA decomposition W' = W + BA, what are the dimensions of matrices B and A for a weight matrix of size d×k with rank r?

Show Answer
B is d×r and A is r×k. The total trainable parameters are r×(d+k), which is much smaller than the d×k parameters in the original matrix when r is small. For example, with d=k=4096 and r=16, this is 131,072 versus 16,777,216 (a 128x reduction).

2. Why is matrix B initialized to zeros and A initialized with random values, rather than the reverse?

Show Answer
Initializing B to zeros ensures that BA = 0 at the start of training, so the model begins as an exact copy of the pretrained base. This provides a stable starting point. The choice of B=0 (rather than A=0) is a convention; either would work. The key requirement is that the product BA starts at zero so training begins from the pretrained behavior.

3. What is the role of the alpha/r scaling factor in LoRA, and how should you adjust alpha when changing rank?

Show Answer
The ratio alpha/r acts as a learning rate multiplier for the LoRA update. When you double r, you should generally double alpha to maintain the same effective scaling. A common starting point is alpha = 2r. If the model is not adapting enough, increase alpha; if training is unstable, decrease it.

4. What three innovations does QLoRA combine, and why is each one necessary?

Show Answer
QLoRA combines: (1) NF4 quantization, which reduces base model memory by 4x with minimal quality loss using a data type optimized for normally distributed weights; (2) double quantization, which compresses the quantization constants themselves, saving an additional ~0.4 bits per parameter; (3) paged optimizers, which use unified memory to gracefully handle GPU memory spikes during training without OOM errors.

5. When should you keep LoRA adapters separate versus merging them into the base model?

Show Answer
Keep adapters separate when you need to serve multiple tasks from a single base model (hot-swapping adapters per request), when storage is a concern (adapters are ~50MB vs. 14+ GB for a full model), or when you plan to update adapters independently. Merge when you need maximum inference speed (zero adapter overhead), when using inference engines that do not support adapter loading, or when deploying a single dedicated model for one task.

Key Takeaways