LoRA is the single most important technique for practical LLM fine-tuning. Instead of updating all model weights, LoRA freezes the pretrained model and injects small trainable low-rank matrices into each layer. This reduces trainable parameters by 100x or more, cuts GPU memory by 60-70%, and produces adapters that can be swapped at serving time without reloading the base model. QLoRA extends this further by quantizing the frozen weights to 4-bit, enabling fine-tuning of 70B models on a single 48GB GPU.
1. The Full Fine-Tuning Problem
When you fine-tune a model with full parameter updates, every weight in the model gets a gradient, an optimizer state (momentum and variance for Adam), and a copy of the updated weight. For a 7B parameter model in FP16, that means 14 GB just for the weights, plus roughly 42 GB for optimizer states, totaling over 56 GB of GPU memory. Scaling to 13B or 70B models makes this prohibitively expensive.
The key insight behind parameter-efficient methods is that the weight changes during fine-tuning are low-rank. Research has shown that when you compute the difference between a fine-tuned model and its pretrained base (the "task-specific delta"), this delta matrix has a very low intrinsic dimensionality. Most of the information in the update can be captured by a much smaller matrix.
| Model Size | Full FT Memory (FP16 + Adam) | LoRA Memory (r=16) | QLoRA Memory (NF4, r=16) |
|---|---|---|---|
| 7B | ~56 GB | ~16 GB | ~6 GB |
| 13B | ~104 GB | ~28 GB | ~10 GB |
| 70B | ~560 GB | ~160 GB | ~36 GB |
2. LoRA Mathematics
2.1 The Core Decomposition
LoRA (Low-Rank Adaptation) works by expressing the weight update as a product of two small matrices. For a pretrained weight matrix W of dimension d × k, instead of computing a full update ΔW (also d × k), LoRA decomposes it as:
W' = W + ΔW = W + BA
where B is d × r and A is r × k, with the rank r being much smaller than both d and k. Typical values of r range from 4 to 64, while d and k are typically 4096 or larger. This means the number of trainable parameters drops from d × k (e.g., 16.7 million for a 4096 × 4096 matrix) to r × (d + k) (e.g., 131,072 for r=16).
2.2 Initialization and Scaling
LoRA uses a specific initialization strategy: matrix A is initialized with a random Gaussian distribution, and matrix B is initialized to all zeros. This means that at the start of training, BA = 0, and the model behaves exactly like the pretrained base. Training gradually moves the model away from this starting point.
The scaling factor α (alpha) controls the magnitude of the LoRA update. The actual update applied is:
W' = W + (α / r) · BA
The ratio α/r acts as a learning rate multiplier for the LoRA weights. A common convention is to set α = 2r (so the effective multiplier is 2), but the optimal value depends on the task. Increasing α relative to r makes the adaptation more aggressive; decreasing it keeps the model closer to the pretrained weights.
When you double the rank r, you should also consider doubling α to maintain the same effective learning rate. Many practitioners set α = 2 × r as a starting point, then adjust based on validation performance. If training diverges, reduce α; if the model barely moves from the base, increase it.
2.3 Why Low-Rank Works
The effectiveness of low-rank adaptation rests on a remarkable empirical finding: the "intrinsic dimensionality" of fine-tuning updates is far lower than the full parameter count would suggest. When researchers analyzed the singular value decomposition of ΔW matrices from full fine-tuning runs, they found that a small number of singular values capture the vast majority of the update's information content. In many cases, ranks as low as 4 or 8 capture over 90% of the useful signal.
This makes intuitive sense. Fine-tuning typically adapts a model to a specific domain or task format. The knowledge required for this adaptation (new terminology, output format preferences, domain-specific reasoning patterns) is a small modification relative to the vast general knowledge encoded in the pretrained weights.
3. LoRA Hyperparameters in Practice
3.1 Rank (r) Selection
| Rank | Trainable Params (7B model) | Best For | Risk |
|---|---|---|---|
| 4 | ~2M | Simple format adaptation, chat templates | May underfit complex tasks |
| 8 | ~4M | Classification, simple instruction following | Good default for most tasks |
| 16 | ~8M | Domain adaptation, moderate complexity | Slight increase in memory |
| 32 | ~16M | Complex reasoning, code generation | Diminishing returns begin |
| 64 | ~33M | Very complex tasks, near full FT quality | Memory approaches full FT |
3.2 Target Module Selection
Not all weight matrices benefit equally from LoRA adaptation. The standard practice is to apply LoRA to the attention projection matrices: q_proj, k_proj, v_proj, and o_proj. Research and practice have converged on the recommendation to also include the MLP layers (gate_proj, up_proj, down_proj) for best results, though this increases trainable parameters.
from peft import LoraConfig, get_peft_model, TaskType
# Standard configuration: attention layers only
lora_config_basic = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32, # alpha = 2r
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
bias="none",
)
# Recommended: attention + MLP layers for best quality
lora_config_full = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
)
# Apply LoRA to a model
model = get_peft_model(model, lora_config_full)
model.print_trainable_parameters()
The target_modules names vary by model architecture. LLaMA uses q_proj, k_proj, etc. Mistral uses the same convention. GPT-NeoX uses query_key_value. Falcon uses query_key_value and dense. You can use target_modules="all-linear" in the PEFT library to automatically target all linear layers.
4. Complete LoRA Fine-Tuning Pipeline
Here is a complete, production-ready pipeline for LoRA fine-tuning using the Hugging Face ecosystem. This example fine-tunes a Llama-3 8B model on instruction-following data.
import torch
from transformers import (
AutoModelForCausalLM, AutoTokenizer,
TrainingArguments, BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import load_dataset
# 1. Load model and tokenizer
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# 2. Configure LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# 3. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
def format_instruction(example):
if example["input"]:
text = f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
else:
text = f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
return {"text": text}
dataset = dataset.map(format_instruction)
# 4. Training arguments
training_args = TrainingArguments(
output_dir="./lora-llama3-8b",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=200,
optim="adamw_torch",
max_grad_norm=1.0,
)
# 5. Train with SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
max_seq_length=2048,
dataset_text_field="text",
)
trainer.train()
trainer.save_model("./lora-llama3-8b/final")
5. QLoRA: 4-bit Quantized LoRA
QLoRA (Quantized LoRA) combines three innovations to dramatically reduce memory requirements: NF4 (Normal Float 4-bit) quantization of the frozen base weights, double quantization to compress the quantization constants themselves, and paged optimizers that gracefully handle GPU memory spikes.
5.1 NF4 Quantization
NF4 is a data type specifically designed for normally distributed neural network weights. Unlike standard 4-bit integer quantization (which spaces values uniformly), NF4 places more quantization levels near zero where neural network weights concentrate. This yields significantly lower quantization error for the same bit budget.
5.2 Double Quantization
Standard quantization requires storing a scale factor and zero-point for each block of weights (typically blocks of 64). These quantization constants are stored in FP32, consuming additional memory. Double quantization applies a second round of quantization to these constants, compressing them from FP32 to FP8 and saving roughly 0.4 bits per parameter across the entire model. For a 70B model, this translates to approximately 3 GB of additional savings.
5.3 Paged Optimizers
During training, GPU memory usage can spike temporarily (for example, during gradient computation for a particularly long sequence). Paged optimizers use NVIDIA's unified memory feature to automatically page optimizer states between GPU and CPU memory when needed, preventing out-of-memory errors. The performance cost is minimal because these spikes are typically brief and infrequent.
5.4 Complete QLoRA Configuration
import torch
from transformers import (
AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# QLoRA: 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 data type
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16
bnb_4bit_use_double_quant=True, # Double quantization
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare model for k-bit training (handles gradient checkpointing)
model = prepare_model_for_kbit_training(model)
# Apply LoRA on top of quantized model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear", # Target all linear layers
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Training with paged optimizer
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
bf16=True,
optim="paged_adamw_8bit", # Paged optimizer
lr_scheduler_type="cosine",
warmup_ratio=0.05,
logging_steps=10,
save_strategy="steps",
save_steps=100,
gradient_checkpointing=True, # Save even more memory
max_grad_norm=0.3, # QLoRA paper recommendation
)
# The rest follows the same SFTTrainer pattern
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
max_seq_length=2048,
)
trainer.train()
QLoRA training is slower than standard LoRA (roughly 30-40% slower per step) because of the overhead of dequantizing weights for each forward pass. The tradeoff is worthwhile when your GPU memory is the binding constraint, but if you have enough VRAM for regular LoRA in BF16, that will train faster.
6. Adapter Merging Strategies
After training, you have two options for deployment: keep the adapter separate (and load it dynamically) or merge it permanently into the base model weights. Each approach has tradeoffs.
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# === Option A: Load adapter separately ===
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load different adapters dynamically
model = PeftModel.from_pretrained(base_model, "./lora-medical")
# Switch to another adapter
model.load_adapter("./lora-legal", adapter_name="legal")
model.set_adapter("legal")
# === Option B: Merge and save ===
model = AutoPeftModelForCausalLM.from_pretrained(
"./lora-medical",
torch_dtype=torch.bfloat16,
device_map="auto",
)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama3-medical-merged")
tokenizer.save_pretrained("./llama3-medical-merged")
# === Option C: Merge QLoRA (requires dequantization) ===
# Load the QLoRA model in full precision for merging
model = AutoPeftModelForCausalLM.from_pretrained(
"./qlora-output",
torch_dtype=torch.float16,
device_map="auto",
low_cpu_mem_usage=True,
)
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
print("Merged model saved. Upload to HF Hub or serve with vLLM.")
When merging QLoRA adapters, you must load the base model in higher precision (FP16 or BF16) first, then merge. This is because the merge operation (W' = W + (α/r) · BA) needs sufficient numerical precision. Merging in 4-bit would introduce unacceptable quantization noise. After merging, you can re-quantize the merged model to GGUF or AWQ for efficient serving.
7. LoRA Hyperparameter Tuning Guide
Finding optimal LoRA hyperparameters requires systematic experimentation. Here is a practical guide based on what works across a wide range of tasks and model sizes.
| Hyperparameter | Default | When to Increase | When to Decrease |
|---|---|---|---|
r (rank) | 16 | Complex tasks, large datasets, reasoning | Simple format changes, small datasets |
lora_alpha | 2 × r | Model not adapting enough | Training diverging, loss spiking |
lora_dropout | 0.05 | Overfitting (val loss rises) | Large dataset, underfitting |
learning_rate | 2e-4 | Underfitting, slow convergence | Divergence, loss oscillation |
max_grad_norm | 1.0 | Very stable training | Gradient spikes (try 0.3) |
LoRA learning rates are typically 5-10x higher than full fine-tuning learning rates. This is because only a small fraction of parameters are being updated, so each update needs to have a larger effect. A learning rate of 2e-4 for LoRA corresponds roughly to 2e-5 for full fine-tuning in terms of per-step model change.
8. The PEFT Library Ecosystem
The Hugging Face peft library provides a unified interface for all parameter-efficient methods. Beyond basic LoRA, it supports loading adapters from the Hub, combining adapters, and quantized training workflows.
from peft import (
PeftModel,
PeftConfig,
get_peft_model,
LoraConfig,
TaskType,
AutoPeftModelForCausalLM,
)
# Load a LoRA adapter from Hugging Face Hub
model = AutoPeftModelForCausalLM.from_pretrained(
"username/my-lora-adapter", # Adapter repo on HF Hub
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Inspect adapter configuration
config = PeftConfig.from_pretrained("username/my-lora-adapter")
print(f"Base model: {config.base_model_name_or_path}")
print(f"Rank: {config.r}, Alpha: {config.lora_alpha}")
print(f"Target modules: {config.target_modules}")
# Push adapter to Hub (only saves the small adapter weights)
model.push_to_hub("username/my-lora-adapter")
# Adapter size: typically 50-200 MB vs 14+ GB for full model
Section 14.1 Quiz
1. In the LoRA decomposition W' = W + BA, what are the dimensions of matrices B and A for a weight matrix of size d×k with rank r?
Show Answer
2. Why is matrix B initialized to zeros and A initialized with random values, rather than the reverse?
Show Answer
3. What is the role of the alpha/r scaling factor in LoRA, and how should you adjust alpha when changing rank?
Show Answer
4. What three innovations does QLoRA combine, and why is each one necessary?
Show Answer
5. When should you keep LoRA adapters separate versus merging them into the base model?
Show Answer
Key Takeaways
- LoRA decomposes weight updates into two small matrices (W' = W + BA), reducing trainable parameters by 100x or more while matching full fine-tuning quality on most tasks.
- Rank (r) controls the capacity of the adaptation. Start with r=16 for most tasks; increase to 32 or 64 only for complex reasoning tasks with sufficient data.
- Alpha scaling (α/r) acts as a learning rate multiplier. Set α = 2r as a default, and adjust based on training stability and downstream performance.
- Target all linear layers (attention + MLP) for best quality. Targeting only attention layers is faster but may sacrifice 1-3% accuracy on complex tasks.
- QLoRA enables 70B fine-tuning on a single GPU by combining NF4 quantization, double quantization, and paged optimizers, at the cost of ~30% slower training.
- Adapter merging converts the LoRA model into a standard model format. Merge in high precision (FP16/BF16), then re-quantize for serving if needed.
- LoRA learning rates are 5-10x higher than full fine-tuning rates because fewer parameters share the gradient signal.