Section 13.1: When and Why to Fine-Tune

★ Big Picture

Fine-tuning is powerful, but it is not always the right tool. Before investing in data collection, GPU hours, and training infrastructure, you need a clear framework for deciding whether fine-tuning will actually solve your problem better than prompt engineering or retrieval-augmented generation. This section provides that framework, covering the core use cases where fine-tuning excels, the different flavors of fine-tuning (full, parameter-efficient, continual pre-training), and the pitfalls that catch teams who jump to fine-tuning prematurely.

1. The Adaptation Spectrum

When a pre-trained language model does not meet your needs out of the box, you have several options for adapting it. These options form a spectrum from lightweight (no training required) to heavyweight (full model retraining). Understanding where each technique sits on this spectrum is essential for making cost-effective decisions.

1.1 Prompting, RAG, and Fine-Tuning

The three primary approaches to model adaptation differ in their complexity, cost, and the types of improvements they can deliver. Prompt engineering is the simplest: you craft instructions that guide the model toward the desired behavior at inference time. RAG augments the model with external knowledge by retrieving relevant documents and injecting them into the prompt. Fine-tuning modifies the model weights themselves through additional training on task-specific data.

Figure 13.1: The adaptation spectrum from lightweight prompting to heavyweight fine-tuning

1.2 The Decision Framework

The following decision framework helps you determine which approach to try first. The key insight is that you should start with the lightest approach that could work and only move to heavier approaches when you have evidence that simpler methods fall short.

def choose_adaptation_strategy(task):
    """Decision framework for choosing between prompting, RAG, and fine-tuning."""

    # Step 1: Can prompting solve it?
    if task.can_be_described_in_prompt:
        baseline = evaluate_with_prompting(task)
        if baseline.meets_quality_threshold:
            return "prompting"  # Simplest solution that works

    # Step 2: Is the gap about missing knowledge?
    if task.requires_external_knowledge:
        if task.knowledge_changes_frequently:
            return "RAG"  # Dynamic knowledge needs retrieval
        if task.knowledge_is_static and task.dataset_size > 10_000:
            return "fine-tuning"  # Large static knowledge: bake it in

    # Step 3: Is the gap about behavior or style?
    if task.requires_specific_style or task.requires_specific_format:
        if few_shot_examples_in_prompt_work:
            return "prompting"  # Few-shot can handle simple format changes
        return "fine-tuning"  # Complex style/format needs weight updates

    # Step 4: Is the gap about latency or cost?
    if task.latency_budget_ms < 200 or task.cost_per_query_budget < 0.001:
        return "fine-tuning"  # Smaller fine-tuned model is faster and cheaper

    # Step 5: Combine approaches
    return "RAG + fine-tuning"  # Many production systems use both

🔑 Key Insight

Start simple, escalate with evidence. The most common mistake teams make is jumping straight to fine-tuning without first trying prompt engineering and few-shot examples. A well-crafted prompt with 5 to 10 examples can often match or exceed a poorly fine-tuned model. Only fine-tune when you have clear evidence that prompting is insufficient and you can articulate why it is insufficient (style, format, latency, cost, or domain knowledge).

2. When Fine-Tuning Excels

Fine-tuning is not a universal solution, but there are specific scenarios where it consistently outperforms prompting and RAG. Understanding these scenarios helps you invest your training budget where it will have the greatest impact.

2.1 Style and Tone Adaptation

When your application requires a consistent voice, persona, or writing style that is difficult to maintain through prompting alone, fine-tuning is often the best solution. A customer support chatbot that must always respond in a specific brand voice, a medical documentation system that must use precise clinical language, or a legal assistant that must follow particular citation conventions are all good candidates for style fine-tuning.

# Example: Style consistency comparison
# Prompting approach (inconsistent across long conversations)
prompt = """You are a friendly customer support agent for TechCorp.
Always use casual, warm language. Never use technical jargon.
Refer to the customer by name when possible.

Customer: My internet keeps dropping every 30 minutes.
Agent:"""

# Fine-tuned approach (consistent by default)
# After fine-tuning on 5,000 TechCorp support transcripts,
# the model inherently produces on-brand responses without
# needing the style instructions in every prompt.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("techcorp/support-llama-7b")
tokenizer = AutoTokenizer.from_pretrained("techcorp/support-llama-7b")

# No style instructions needed; the model learned the voice
messages = [
    {"role": "user", "content": "My internet keeps dropping every 30 minutes."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(inputs, max_new_tokens=256)

2.2 Domain Knowledge Injection

When your task requires deep understanding of a specialized domain (medicine, law, finance, a particular codebase), fine-tuning can embed that knowledge directly into the model weights. This is particularly valuable when the domain vocabulary and reasoning patterns differ significantly from general text.

Scenario	Best Approach	Rationale
Answer questions about company policies	RAG	Policies change frequently; retrieval keeps answers current
Generate clinical notes in SOAP format	Fine-tuning	Format and medical terminology are stable and trainable
Summarize legal contracts	Fine-tuning + RAG	Legal language patterns (fine-tune) with specific clause lookup (RAG)
Customer support with product catalog	RAG	Product details change with each release cycle
Code review for internal framework	Fine-tuning	Framework patterns are stable; baking them in reduces prompt size
Real-time news Q&A	RAG	Knowledge must be up to the minute

2.3 Output Format Enforcement

If your application requires strict output formatting (JSON with specific schemas, XML, particular markdown structures), fine-tuning on thousands of correctly formatted examples can make the model more reliable than prompting alone. The model learns the structural patterns at a deeper level than instruction following can achieve.

# Measuring format compliance: prompting vs. fine-tuning
import json
from typing import Dict, List

def evaluate_format_compliance(
    model_outputs: List[str],
    required_schema: Dict
) -> Dict[str, float]:
    """Compare format compliance between prompting and fine-tuning."""
    results = {
        "total": len(model_outputs),
        "valid_json": 0,
        "schema_compliant": 0,
        "field_completeness": []
    }

    required_fields = set(required_schema.get("required", []))

    for output in model_outputs:
        # Check if output is valid JSON
        try:
            parsed = json.loads(output)
            results["valid_json"] += 1
        except json.JSONDecodeError:
            results["field_completeness"].append(0.0)
            continue

        # Check schema compliance
        output_fields = set(parsed.keys())
        completeness = len(required_fields & output_fields) / len(required_fields)
        results["field_completeness"].append(completeness)

        if required_fields.issubset(output_fields):
            results["schema_compliant"] += 1

    results["json_rate"] = results["valid_json"] / results["total"]
    results["compliance_rate"] = results["schema_compliant"] / results["total"]
    results["avg_completeness"] = sum(results["field_completeness"]) / results["total"]

    return results

# Typical results:
# Prompting (GPT-4):  json_rate=0.95, compliance_rate=0.82
# Fine-tuned (Llama): json_rate=0.99, compliance_rate=0.97

2.4 Latency and Cost Optimization

Fine-tuning a smaller model to match the performance of a larger model on your specific task is one of the most compelling economic arguments for fine-tuning. A fine-tuned 7B parameter model that matches GPT-4 quality on your narrow task can reduce inference cost by 10x to 50x and latency by 3x to 5x.

Figure 13.2: Fine-tuning a small model can approach large model quality at a fraction of the cost

3. Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning

Once you decide to fine-tune, the next decision is whether to update all model parameters (full fine-tuning) or only a small subset (parameter-efficient fine-tuning, or PEFT). This choice has significant implications for compute cost, storage, and the risk of catastrophic forgetting.

Aspect	Full Fine-Tuning	Parameter-Efficient (LoRA/QLoRA)
Parameters updated	All (100%)	0.1% to 2%
GPU memory (7B model)	~60 GB (FP16)	~16 GB (QLoRA 4-bit)
Training time	Hours to days	Minutes to hours
Storage per checkpoint	14 GB (7B FP16)	50 to 200 MB (adapter only)
Forgetting risk	Higher	Lower (base model frozen)
Task performance	Slightly higher ceiling	Within 1 to 3% of full fine-tuning
Multi-task serving	Separate model per task	Shared base + swappable adapters

📝 Note

Parameter-efficient fine-tuning (PEFT) techniques like LoRA are covered in detail in Module 14. This section focuses on the conceptual decision of when to fine-tune; Module 14 covers the how of doing it efficiently. For the SFT workflows in Section 13.3, we cover full fine-tuning; the same principles apply when using LoRA adapters.

# Quick comparison: resource requirements
def estimate_training_resources(
    model_size_billions: float,
    method: str = "full",  # "full", "lora", "qlora"
    precision: str = "fp16"
) -> dict:
    """Estimate GPU memory and storage for fine-tuning."""
    bytes_per_param = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}

    param_bytes = bytes_per_param.get(precision, 2)
    model_memory_gb = model_size_billions * 1e9 * param_bytes / (1024**3)

    if method == "full":
        # Model + gradients + optimizer states (AdamW: 2x for momentum)
        training_memory_gb = model_memory_gb * 4  # Rough 4x multiplier
        trainable_params = model_size_billions * 1e9
        checkpoint_gb = model_memory_gb
    elif method == "lora":
        # Frozen model + small adapter gradients/optimizer
        trainable_params = model_size_billions * 1e9 * 0.01  # ~1% of params
        training_memory_gb = model_memory_gb + 2  # Base + adapter overhead
        checkpoint_gb = 0.1  # Adapter only
    elif method == "qlora":
        # 4-bit quantized model + adapter
        model_memory_gb = model_size_billions * 1e9 * 0.5 / (1024**3)
        trainable_params = model_size_billions * 1e9 * 0.01
        training_memory_gb = model_memory_gb + 2
        checkpoint_gb = 0.1

    return {
        "method": method,
        "model_size": f"{model_size_billions}B",
        "training_memory_gb": round(training_memory_gb, 1),
        "trainable_params": f"{trainable_params/1e6:.1f}M",
        "checkpoint_size_gb": round(checkpoint_gb, 1),
        "min_gpu": "A100 80GB" if training_memory_gb > 40 else "A100 40GB"
            if training_memory_gb > 20 else "RTX 4090 24GB"
            if training_memory_gb > 16 else "RTX 3090 24GB"
    }

# Compare methods for a 7B model
for method in ["full", "lora", "qlora"]:
    result = estimate_training_resources(7.0, method=method)
    print(f"{method:6s}: {result['training_memory_gb']:5.1f} GB, "
          f"{result['trainable_params']:>8s} params, "
          f"checkpoint: {result['checkpoint_size_gb']} GB")

full : 56.0 GB, 7000.0M params, checkpoint: 14.0 GB lora : 16.0 GB, 70.0M params, checkpoint: 0.1 GB qlora : 5.5 GB, 70.0M params, checkpoint: 0.1 GB

4. Catastrophic Forgetting

Catastrophic forgetting is the phenomenon where a model, after being fine-tuned on a specific task, loses its ability to perform well on other tasks it could previously handle. This happens because gradient updates that improve performance on the fine-tuning data can overwrite weights that encode general knowledge.

4.1 Symptoms and Causes

The most common symptoms of catastrophic forgetting include degraded performance on general benchmarks (MMLU, HellaSwag), loss of instruction-following ability, increased repetition or degenerate outputs, and inability to handle prompts outside the fine-tuning distribution. The primary causes are training for too many epochs, using a learning rate that is too high, training on a dataset that is too narrow in distribution, and failing to include regularization.

Figure 13.3: As task-specific performance improves, general capabilities may degrade. The optimal checkpoint balances both.

4.2 Mitigation Strategies

# Strategies for mitigating catastrophic forgetting
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ForgettingMitigationConfig:
    """Configuration for preventing catastrophic forgetting."""

    # 1. Learning rate: use a low learning rate for fine-tuning
    learning_rate: float = 2e-5  # 10x lower than pre-training

    # 2. Short training: fewer epochs reduce overwriting
    num_epochs: int = 3  # Rarely need more than 3-5

    # 3. Data mixing: include general-purpose data
    task_data_ratio: float = 0.7   # 70% task-specific
    general_data_ratio: float = 0.3  # 30% general (e.g., OpenAssistant)

    # 4. Regularization
    weight_decay: float = 0.01
    max_grad_norm: float = 1.0

    # 5. Evaluation on general benchmarks during training
    eval_general_benchmarks: bool = True
    general_eval_datasets: List[str] = None

    def __post_init__(self):
        if self.general_eval_datasets is None:
            self.general_eval_datasets = [
                "mmlu",         # General knowledge
                "hellaswag",    # Commonsense reasoning
                "arc_easy",     # Science questions
            ]

    def get_data_mix(self, task_samples: int) -> dict:
        """Calculate how many general samples to mix in."""
        general_samples = int(
            task_samples * self.general_data_ratio / self.task_data_ratio
        )
        return {
            "task_samples": task_samples,
            "general_samples": general_samples,
            "total": task_samples + general_samples,
            "effective_task_ratio": task_samples / (task_samples + general_samples)
        }

config = ForgettingMitigationConfig()
mix = config.get_data_mix(task_samples=5000)
print(f"Task: {mix['task_samples']}, General: {mix['general_samples']}, "
      f"Total: {mix['total']}")

Task: 5000, General: 2142, Total: 7142

⚠ Warning

Do not skip general evaluation. Many teams only measure performance on their target task during fine-tuning and discover too late that the model has lost critical general capabilities. Always evaluate on at least 2 to 3 general benchmarks at every checkpoint. If general performance drops more than 5% from the base model, you are likely overtraining.

5. Continual Pre-Training vs. Instruction Fine-Tuning

Fine-tuning comes in two distinct flavors that serve different purposes. Continual pre-training (also called domain-adaptive pre-training) extends the original pre-training objective on domain-specific text. Instruction fine-tuning (also called supervised fine-tuning or SFT) trains the model to follow instructions and produce specific outputs. Understanding the difference is critical for choosing the right approach.

5.1 Continual Pre-Training

Continual pre-training uses the same next-token prediction objective as the original pre-training, but on a domain-specific corpus. The model learns the vocabulary, concepts, and reasoning patterns of the target domain without any explicit instruction/output pairs. This is useful when the model lacks fundamental domain knowledge.

5.2 Instruction Fine-Tuning (SFT)

Instruction fine-tuning trains the model on input/output pairs where each input is a user instruction or query and each output is the desired response. This teaches the model to follow instructions, produce specific output formats, and adopt particular behaviors. Most practical fine-tuning falls into this category.

Aspect	Continual Pre-Training	Instruction Fine-Tuning (SFT)
Training objective	Next-token prediction (causal LM)	Supervised on instruction/output pairs
Data format	Raw text (documents, papers)	Structured pairs (instruction, response)
Data quantity	Millions to billions of tokens	Thousands to tens of thousands of examples
Purpose	Inject domain knowledge	Teach behavior and format
Typical use	Medical, legal, financial models	Chatbots, task-specific assistants
Training duration	Days to weeks	Hours to a day
Example	Train on 10B tokens of medical literature	Train on 10K medical Q&A pairs

🔑 Key Insight

The two-stage pipeline. For domain-specific applications, the most effective approach is often a two-stage pipeline: first, continual pre-training on domain text to inject knowledge, then instruction fine-tuning to teach the model how to use that knowledge in response to user queries. This separates the "what to know" stage from the "how to behave" stage and typically produces better results than either stage alone.

Section 13.1 Quiz

1. A company needs a model that answers questions about internal policies that change quarterly. Which approach is most appropriate?

Show Answer

RAG is the best choice here. Since the policies change frequently (quarterly), retrieval-augmented generation allows the system to always serve the most current information without retraining. Fine-tuning would require retraining every quarter and risks serving stale information between updates.

2. What is the primary risk of fine-tuning a model for too many epochs on a narrow dataset?

Show Answer

Catastrophic forgetting. Training for too many epochs on a narrow dataset causes the model to overwrite general knowledge encoded in its weights. The model may excel at the specific task but lose its ability to handle general prompts, follow diverse instructions, or reason about topics outside the training distribution.

3. A startup wants to deploy a model that generates JSON output with a strict schema for 100,000 requests per day. Prompt engineering yields 85% schema compliance. What should they try next?

Show Answer

Fine-tuning on a dataset of correctly formatted JSON outputs. At 100K requests/day, the 15% failure rate from prompting means 15,000 failed requests daily. Fine-tuning typically achieves 97%+ schema compliance, reducing failures to 3,000 or fewer. The cost of fine-tuning is quickly recovered through reduced error handling and retry costs.

4. How does QLoRA reduce the GPU memory required for fine-tuning a 7B parameter model compared to full fine-tuning?

Show Answer

QLoRA reduces memory in two ways. First, the base model is quantized to 4-bit precision, reducing its memory footprint by approximately 4x compared to FP16. Second, only a small set of low-rank adapter parameters (roughly 1% of total parameters) require gradient computation and optimizer states. Together, this reduces GPU memory from approximately 56 GB to roughly 5.5 GB, making fine-tuning feasible on consumer GPUs.

5. What is the difference between continual pre-training and instruction fine-tuning?

Show Answer

Continual pre-training extends the original pre-training objective (next-token prediction) on domain-specific raw text, injecting domain knowledge and vocabulary into the model. Instruction fine-tuning trains on structured input/output pairs, teaching the model to follow instructions and produce specific response formats. They serve different purposes: continual pre-training teaches "what to know" while instruction fine-tuning teaches "how to behave."

Key Takeaways

Start with prompting, then RAG, and only fine-tune when you have evidence that simpler approaches are insufficient for your quality, latency, or cost requirements.
Fine-tuning excels at style adaptation, output format enforcement, latency/cost optimization through model distillation, and injecting stable domain knowledge.
RAG is better for dynamic knowledge that changes frequently, as fine-tuned knowledge becomes stale.
Parameter-efficient methods (LoRA, QLoRA) achieve within 1 to 3% of full fine-tuning performance while using 10x less memory and enabling multi-task serving with swappable adapters.
Catastrophic forgetting is mitigated by using low learning rates, short training schedules, data mixing with general-purpose examples, and continuous evaluation on general benchmarks.
Two-stage fine-tuning (continual pre-training for knowledge, then SFT for behavior) is often the most effective approach for domain-specific applications.