Module 14 · Section 14.2

Advanced PEFT Methods

Beyond LoRA: DoRA, LoRA+, Prefix Tuning, adapters, IA3, and multi-adapter serving
★ Big Picture

LoRA dominates the PEFT landscape, but it is not the only option. Researchers have developed numerous alternatives that offer different tradeoffs in parameter count, training speed, inference overhead, and task specialization. DoRA improves LoRA by decomposing weights into magnitude and direction components. LoRA+ uses different learning rates for the A and B matrices. Prefix Tuning and Prompt Tuning prepend learnable tokens rather than modifying weights. IA3 achieves extreme parameter efficiency by learning only rescaling vectors. Understanding these alternatives helps you select the right tool for each scenario, particularly when operating under tight memory, latency, or multi-tenant serving constraints.

1. DoRA: Weight-Decomposed Low-Rank Adaptation

DoRA (2024) improves on LoRA by decomposing the pretrained weight into its magnitude and direction components before applying the low-rank update. Specifically, a weight vector w is decomposed as w = m · (w / ||w||), where m is a learnable magnitude scalar and the direction is updated via standard LoRA. This decomposition aligns more closely with how full fine-tuning actually modifies weights, resulting in better performance for the same rank.

In practice, DoRA consistently outperforms LoRA by 1-3% across benchmarks when using the same rank and target modules, with only a marginal increase in trainable parameters (the additional magnitude vectors are tiny). The training speed is nearly identical to LoRA.

LoRA vs DoRA Weight Update Standard LoRA W + B A W' = W + BA Updates weight directly in the combined space DoRA m × direction + B A W' = m · (V + BA) / ||V + BA|| Separates magnitude from direction for better learning
Figure 1: DoRA separates weight magnitude (m) from direction, applying LoRA only to the directional component.
from peft import LoraConfig, get_peft_model

# DoRA configuration: simply enable use_dora flag
dora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    use_dora=True,          # Enable DoRA
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, dora_config)
model.print_trainable_parameters()
# Slightly more params than LoRA due to magnitude vectors,
# but typically ~1-3% better accuracy at same rank.
◆ Key Insight

DoRA is a drop-in replacement for LoRA. In the PEFT library, you simply add use_dora=True to your existing LoRA configuration. The overhead is minimal (a few extra parameters per layer for the magnitude vector), but the quality improvement is consistent. If you are already using LoRA and want a free accuracy boost, try DoRA first.

2. LoRA+: Asymmetric Learning Rates

LoRA+ (2024) addresses a subtle inefficiency in standard LoRA training. In the decomposition W' = W + BA, matrices B and A play different roles: A projects inputs into the low-rank space, while B projects back to the full space. LoRA+ assigns a higher learning rate to matrix B (typically 2x to 16x higher) than to matrix A, based on the theoretical analysis that the optimal learning rate ratio should scale with the model width.

In practice, LoRA+ improves convergence speed by 1.5x to 2x compared to standard LoRA, often reaching the same final quality in fewer steps. This is particularly valuable when training compute is the bottleneck.

from peft import LoraConfig, get_peft_model
from transformers import TrainingArguments
import torch

# LoRA+ uses different learning rates for A and B matrices
# This is implemented via parameter groups in the optimizer
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)

# Separate parameter groups for A and B matrices
lora_A_params = []
lora_B_params = []
other_params = []

for name, param in model.named_parameters():
    if not param.requires_grad:
        continue
    if "lora_A" in name:
        lora_A_params.append(param)
    elif "lora_B" in name:
        lora_B_params.append(param)
    else:
        other_params.append(param)

# LoRA+ recommendation: lr_B = lr_ratio * lr_A
lr_A = 2e-4
lr_ratio = 8     # B learns 8x faster than A

optimizer = torch.optim.AdamW([
    {"params": lora_A_params, "lr": lr_A},
    {"params": lora_B_params, "lr": lr_A * lr_ratio},
    {"params": other_params,  "lr": lr_A},
], weight_decay=0.01)

3. Prefix Tuning & P-Tuning

3.1 Prefix Tuning

Prefix Tuning prepends a set of learnable "virtual tokens" to the key and value matrices at every attention layer. These prefix vectors are optimized during training while the rest of the model remains frozen. Because the prefixes operate entirely within the attention mechanism, they can steer model behavior without modifying any weight matrices.

The prefix length (number of virtual tokens) controls the method's capacity. Typical values range from 10 to 100. Longer prefixes allow more expressive adaptation but increase the effective sequence length and thus the computational cost of each forward pass.

Prefix Tuning: Learnable Tokens Prepended to Attention Attention Layer P1 P2 P3 ... Pn Token1 Token2 Token3 ... TokenN [Output] Trainable prefix (K, V) Frozen input tokens
Figure 2: Prefix Tuning prepends learnable key-value pairs to each attention layer, steering attention without modifying weights.
from peft import PrefixTuningConfig, get_peft_model, TaskType

# Prefix Tuning configuration
prefix_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=30,        # Number of prefix tokens
    prefix_projection=True,       # Use MLP to project prefix (more stable)
    encoder_hidden_size=1024,     # Hidden size of projection MLP
)

model = get_peft_model(base_model, prefix_config)
model.print_trainable_parameters()
# Typically 0.1-0.5% of total parameters

3.2 Prompt Tuning

Prompt Tuning (from Google, 2021) is even simpler than Prefix Tuning: it prepends learnable embeddings to the input only at the embedding layer, not at every attention layer. This makes it the most parameter-efficient method (often under 0.01% of parameters) but limits its expressiveness. Prompt Tuning works best for large models (100B+), where a single model can serve many tasks by simply swapping the learned prompt prefix.

from peft import PromptTuningConfig, get_peft_model, TaskType
from peft import PromptTuningInit

# Prompt Tuning: learns soft tokens at input layer only
prompt_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,
    prompt_tuning_init=PromptTuningInit.TEXT,  # Initialize from text
    prompt_tuning_init_text="Classify the following text as positive or negative:",
    tokenizer_name_or_path="meta-llama/Meta-Llama-3-8B",
)

model = get_peft_model(base_model, prompt_config)
model.print_trainable_parameters()
# trainable params: ~80K (extremely small)
ⓘ Note

Prompt Tuning and Prefix Tuning reduce the effective context window by the number of virtual tokens. If you use 30 prefix tokens with a 4096-token context, your actual usable context is 4066 tokens. For most applications this is negligible, but keep it in mind when working near the context limit.

4. Adapter Layers

Adapter methods (Houlsby et al., 2019) insert small bottleneck modules between existing transformer layers. Each adapter consists of a down-projection, a nonlinear activation, and an up-projection. The adapter bottleneck dimension controls the parameter count, similar to how rank works in LoRA.

Unlike LoRA (which modifies existing weight matrices), adapters add new layers to the network. This means they introduce a small amount of inference latency, because the adapter computation happens sequentially. However, adapters are very flexible and can be inserted at various points in the architecture.

from peft import AdaptionPromptConfig, get_peft_model

# Note: For bottleneck adapters, use the adapters library
# from adapters import AutoAdapterModel

# Example with LLaMA-Adapter style (via PEFT)
adapter_config = AdaptionPromptConfig(
    adapter_len=10,            # Length of adapter prompt
    adapter_layers=30,         # Number of layers to add adapters
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, adapter_config)
model.print_trainable_parameters()

5. IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations

IA3 (few-shot parameter-efficient fine-tuning Is All You Need, 2022) takes parameter efficiency to the extreme. Instead of learning new matrices or inserting new layers, IA3 learns only three rescaling vectors that modulate the keys, values, and intermediate activations in the feedforward layers. The total number of trainable parameters is typically 10x smaller than LoRA.

The tradeoff is that IA3's limited capacity makes it best suited for simple adaptation tasks (format changes, style transfer) rather than complex domain adaptation. It excels in few-shot settings where overfitting is a concern.

from peft import IA3Config, get_peft_model, TaskType

ia3_config = IA3Config(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],
)

model = get_peft_model(base_model, ia3_config)
model.print_trainable_parameters()
# trainable params: ~500K for a 7B model (0.007%)

6. Comprehensive PEFT Method Comparison

Method Params (%) Memory Inference Overhead Best For
LoRA 0.1-0.5% Low Zero (after merge) General purpose, most tasks
QLoRA 0.1-0.5% Very Low Zero (after merge) Large models on limited GPU
DoRA 0.1-0.5% Low Zero (after merge) When LoRA quality is insufficient
LoRA+ 0.1-0.5% Low Zero (after merge) Faster convergence needed
Prefix Tuning 0.1-0.5% Low Small (longer KV cache) NLU tasks, multi-task serving
Prompt Tuning <0.01% Very Low Negligible Very large models, simple tasks
Adapters 0.5-3% Medium Small (sequential) Compositional multi-task
IA3 <0.01% Very Low Negligible Few-shot, style adaptation
⚠ Warning

Prompt Tuning and IA3 achieve extreme parameter efficiency, but they are significantly less capable than LoRA for complex adaptation tasks. If your task requires learning new knowledge (domain-specific terminology, code patterns, specialized reasoning), LoRA or DoRA with a reasonable rank (16-64) will substantially outperform these lighter methods. Reserve IA3 and Prompt Tuning for scenarios where simplicity or parameter count is the primary constraint.

7. Multi-Adapter Serving

One of LoRA's most powerful production features is the ability to serve many adapters from a single base model. This enables multi-tenant deployments where each customer or task gets its own fine-tuned behavior without duplicating the base model weights. Two main systems support this at scale: LoRAX (formerly LoRAX by Predibase) and S-LoRA.

7.1 Architecture Overview

Multi-Adapter Serving Architecture Base Model (GPU) Single copy in VRAM Medical LoRA Legal LoRA Finance LoRA Code LoRA Request Router Routes each request to the correct adapter based on tenant ID or task type Batches requests across adapters for GPU utilization
Figure 3: Multi-adapter serving loads one base model and dynamically applies per-request LoRA adapters.

7.2 LoRAX and S-LoRA

LoRAX (Predibase) is a production-grade serving system that can host hundreds of fine-tuned LoRA adapters on a single GPU. It keeps the base model in GPU memory and dynamically loads adapter weights per request. Key features include adapter weight caching, batched inference across different adapters, and automatic adapter management.

S-LoRA (from UC Berkeley) takes a more research-oriented approach, using unified paging to manage adapter memory and custom CUDA kernels for batched LoRA computation. S-LoRA can serve thousands of adapters simultaneously, with adapters stored in a tiered memory system (GPU, CPU, disk) and paged in on demand.

◆ Key Insight

Multi-adapter serving is one of the strongest arguments for LoRA over other PEFT methods in production. A single A100 GPU can serve a base 7B model with hundreds of LoRA adapters, effectively providing hundreds of specialized models at the cost of one. This is far more economical than deploying separate merged models for each use case.

8. Choosing the Right PEFT Method

With so many PEFT options available, the decision can feel overwhelming. Here is a practical decision framework based on your constraints and requirements.

ScenarioRecommended MethodReasoning
General fine-tuning (default)LoRA (r=16)Best quality/efficiency tradeoff, widest ecosystem support
Limited GPU memoryQLoRA4-bit base model frees VRAM for larger models or batches
Need extra quality over LoRADoRADrop-in upgrade, consistent 1-3% improvement
Training speed is criticalLoRA+1.5-2x faster convergence, same final quality
Multi-tenant serving (100+ tasks)LoRA + LoRAXHot-swappable adapters from single base
Extreme parameter budget (<1K params)IA3Learns only rescaling vectors, minimal overfitting
Very large model (100B+), simple taskPrompt TuningUltra-lightweight, scales well with model size
NLU classification tasksPrefix TuningStrong at steering attention for classification
ⓘ Note

When in doubt, start with LoRA. It has the widest library support, the most documentation, and works well across virtually all tasks and model sizes. Move to specialized methods only when you have a specific constraint (memory, serving architecture, parameter count) that LoRA cannot satisfy.

Section 14.2 Quiz

1. How does DoRA differ from standard LoRA, and when would you prefer it?

Show Answer
DoRA decomposes the pretrained weight into magnitude and direction components, then applies LoRA only to the direction. This more closely mirrors how full fine-tuning modifies weights. In the PEFT library, you enable it with use_dora=True. Prefer DoRA when you want a free 1-3% accuracy improvement with minimal extra cost over standard LoRA.

2. What is the key idea behind LoRA+, and what practical benefit does it provide?

Show Answer
LoRA+ assigns different learning rates to the A and B matrices in the LoRA decomposition. Based on theoretical analysis, matrix B should have a higher learning rate (typically 2-16x) than matrix A. The practical benefit is 1.5-2x faster convergence, reaching the same quality in fewer training steps.

3. What is the fundamental difference between Prefix Tuning and Prompt Tuning?

Show Answer
Prefix Tuning prepends learnable key-value vectors to every attention layer in the model. Prompt Tuning prepends learnable embeddings only at the input embedding layer. Prefix Tuning is more expressive (modifies attention at every layer) but has more parameters. Prompt Tuning is extremely lightweight and works best with very large models.

4. Why is multi-adapter serving a uniquely strong advantage of LoRA over other PEFT methods?

Show Answer
LoRA adapters are small weight matrices that can be applied additively to frozen base model weights. This means a single base model in GPU memory can serve hundreds of different adapters by swapping them per request. Systems like LoRAX and S-LoRA make this efficient at scale. Adapter-based methods add sequential computation, making them harder to batch. Prefix and Prompt Tuning can also be swapped, but LoRA has far better tooling and ecosystem support for this use case.

5. In what scenario would IA3 be a better choice than LoRA?

Show Answer
IA3 is preferred when you have an extremely tight parameter budget (it trains roughly 10x fewer parameters than LoRA), when you are fine-tuning on very small datasets where overfitting is a concern, or when the task is simple (style transfer, format change) and does not require learning substantial new knowledge. For complex domain adaptation or reasoning tasks, LoRA will significantly outperform IA3.

Key Takeaways