Module 13 · Section 13.1

When and Why to Fine-Tune

A decision framework for choosing between prompting, RAG, and fine-tuning, with guidelines for understanding when each approach delivers the best return on investment
★ Big Picture

Fine-tuning is powerful, but it is not always the right tool. Before investing in data collection, GPU hours, and training infrastructure, you need a clear framework for deciding whether fine-tuning will actually solve your problem better than prompt engineering or retrieval-augmented generation. This section provides that framework, covering the core use cases where fine-tuning excels, the different flavors of fine-tuning (full, parameter-efficient, continual pre-training), and the pitfalls that catch teams who jump to fine-tuning prematurely.

1. The Adaptation Spectrum

When a pre-trained language model does not meet your needs out of the box, you have several options for adapting it. These options form a spectrum from lightweight (no training required) to heavyweight (full model retraining). Understanding where each technique sits on this spectrum is essential for making cost-effective decisions.

1.1 Prompting, RAG, and Fine-Tuning

The three primary approaches to model adaptation differ in their complexity, cost, and the types of improvements they can deliver. Prompt engineering is the simplest: you craft instructions that guide the model toward the desired behavior at inference time. RAG augments the model with external knowledge by retrieving relevant documents and injecting them into the prompt. Fine-tuning modifies the model weights themselves through additional training on task-specific data.

Lightweight Heavyweight Prompt Engineering No training needed Minutes to iterate Limited by context window Higher inference cost RAG Retrieval infrastructure Days to build pipeline Dynamic knowledge Retrieval latency added Fine-Tuning Training data + GPUs Days to weeks Baked-in knowledge Lower inference cost Increasing: setup cost, training time, control over behavior Decreasing: iteration speed, flexibility to change
Figure 13.1: The adaptation spectrum from lightweight prompting to heavyweight fine-tuning

1.2 The Decision Framework

The following decision framework helps you determine which approach to try first. The key insight is that you should start with the lightest approach that could work and only move to heavier approaches when you have evidence that simpler methods fall short.

def choose_adaptation_strategy(task):
    """Decision framework for choosing between prompting, RAG, and fine-tuning."""

    # Step 1: Can prompting solve it?
    if task.can_be_described_in_prompt:
        baseline = evaluate_with_prompting(task)
        if baseline.meets_quality_threshold:
            return "prompting"  # Simplest solution that works

    # Step 2: Is the gap about missing knowledge?
    if task.requires_external_knowledge:
        if task.knowledge_changes_frequently:
            return "RAG"  # Dynamic knowledge needs retrieval
        if task.knowledge_is_static and task.dataset_size > 10_000:
            return "fine-tuning"  # Large static knowledge: bake it in

    # Step 3: Is the gap about behavior or style?
    if task.requires_specific_style or task.requires_specific_format:
        if few_shot_examples_in_prompt_work:
            return "prompting"  # Few-shot can handle simple format changes
        return "fine-tuning"  # Complex style/format needs weight updates

    # Step 4: Is the gap about latency or cost?
    if task.latency_budget_ms < 200 or task.cost_per_query_budget < 0.001:
        return "fine-tuning"  # Smaller fine-tuned model is faster and cheaper

    # Step 5: Combine approaches
    return "RAG + fine-tuning"  # Many production systems use both
🔑 Key Insight

Start simple, escalate with evidence. The most common mistake teams make is jumping straight to fine-tuning without first trying prompt engineering and few-shot examples. A well-crafted prompt with 5 to 10 examples can often match or exceed a poorly fine-tuned model. Only fine-tune when you have clear evidence that prompting is insufficient and you can articulate why it is insufficient (style, format, latency, cost, or domain knowledge).

2. When Fine-Tuning Excels

Fine-tuning is not a universal solution, but there are specific scenarios where it consistently outperforms prompting and RAG. Understanding these scenarios helps you invest your training budget where it will have the greatest impact.

2.1 Style and Tone Adaptation

When your application requires a consistent voice, persona, or writing style that is difficult to maintain through prompting alone, fine-tuning is often the best solution. A customer support chatbot that must always respond in a specific brand voice, a medical documentation system that must use precise clinical language, or a legal assistant that must follow particular citation conventions are all good candidates for style fine-tuning.

# Example: Style consistency comparison
# Prompting approach (inconsistent across long conversations)
prompt = """You are a friendly customer support agent for TechCorp.
Always use casual, warm language. Never use technical jargon.
Refer to the customer by name when possible.

Customer: My internet keeps dropping every 30 minutes.
Agent:"""

# Fine-tuned approach (consistent by default)
# After fine-tuning on 5,000 TechCorp support transcripts,
# the model inherently produces on-brand responses without
# needing the style instructions in every prompt.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("techcorp/support-llama-7b")
tokenizer = AutoTokenizer.from_pretrained("techcorp/support-llama-7b")

# No style instructions needed; the model learned the voice
messages = [
    {"role": "user", "content": "My internet keeps dropping every 30 minutes."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
output = model.generate(inputs, max_new_tokens=256)

2.2 Domain Knowledge Injection

When your task requires deep understanding of a specialized domain (medicine, law, finance, a particular codebase), fine-tuning can embed that knowledge directly into the model weights. This is particularly valuable when the domain vocabulary and reasoning patterns differ significantly from general text.

ScenarioBest ApproachRationale
Answer questions about company policiesRAGPolicies change frequently; retrieval keeps answers current
Generate clinical notes in SOAP formatFine-tuningFormat and medical terminology are stable and trainable
Summarize legal contractsFine-tuning + RAGLegal language patterns (fine-tune) with specific clause lookup (RAG)
Customer support with product catalogRAGProduct details change with each release cycle
Code review for internal frameworkFine-tuningFramework patterns are stable; baking them in reduces prompt size
Real-time news Q&ARAGKnowledge must be up to the minute

2.3 Output Format Enforcement

If your application requires strict output formatting (JSON with specific schemas, XML, particular markdown structures), fine-tuning on thousands of correctly formatted examples can make the model more reliable than prompting alone. The model learns the structural patterns at a deeper level than instruction following can achieve.

# Measuring format compliance: prompting vs. fine-tuning
import json
from typing import Dict, List

def evaluate_format_compliance(
    model_outputs: List[str],
    required_schema: Dict
) -> Dict[str, float]:
    """Compare format compliance between prompting and fine-tuning."""
    results = {
        "total": len(model_outputs),
        "valid_json": 0,
        "schema_compliant": 0,
        "field_completeness": []
    }

    required_fields = set(required_schema.get("required", []))

    for output in model_outputs:
        # Check if output is valid JSON
        try:
            parsed = json.loads(output)
            results["valid_json"] += 1
        except json.JSONDecodeError:
            results["field_completeness"].append(0.0)
            continue

        # Check schema compliance
        output_fields = set(parsed.keys())
        completeness = len(required_fields & output_fields) / len(required_fields)
        results["field_completeness"].append(completeness)

        if required_fields.issubset(output_fields):
            results["schema_compliant"] += 1

    results["json_rate"] = results["valid_json"] / results["total"]
    results["compliance_rate"] = results["schema_compliant"] / results["total"]
    results["avg_completeness"] = sum(results["field_completeness"]) / results["total"]

    return results

# Typical results:
# Prompting (GPT-4):  json_rate=0.95, compliance_rate=0.82
# Fine-tuned (Llama): json_rate=0.99, compliance_rate=0.97

2.4 Latency and Cost Optimization

Fine-tuning a smaller model to match the performance of a larger model on your specific task is one of the most compelling economic arguments for fine-tuning. A fine-tuned 7B parameter model that matches GPT-4 quality on your narrow task can reduce inference cost by 10x to 50x and latency by 3x to 5x.

Cost vs. Quality Tradeoff Task-Specific Quality (F1 Score) Cost per 1K Queries ($) $50 $20 $5 $1 $0.1 0.70 0.80 0.90 0.95 Llama-7B (base) GPT-4 Fine-tuned Llama-7B GPT-4o GPT-4o-mini
Figure 13.2: Fine-tuning a small model can approach large model quality at a fraction of the cost

3. Full Fine-Tuning vs. Parameter-Efficient Fine-Tuning

Once you decide to fine-tune, the next decision is whether to update all model parameters (full fine-tuning) or only a small subset (parameter-efficient fine-tuning, or PEFT). This choice has significant implications for compute cost, storage, and the risk of catastrophic forgetting.

AspectFull Fine-TuningParameter-Efficient (LoRA/QLoRA)
Parameters updatedAll (100%)0.1% to 2%
GPU memory (7B model)~60 GB (FP16)~16 GB (QLoRA 4-bit)
Training timeHours to daysMinutes to hours
Storage per checkpoint14 GB (7B FP16)50 to 200 MB (adapter only)
Forgetting riskHigherLower (base model frozen)
Task performanceSlightly higher ceilingWithin 1 to 3% of full fine-tuning
Multi-task servingSeparate model per taskShared base + swappable adapters
📝 Note

Parameter-efficient fine-tuning (PEFT) techniques like LoRA are covered in detail in Module 14. This section focuses on the conceptual decision of when to fine-tune; Module 14 covers the how of doing it efficiently. For the SFT workflows in Section 13.3, we cover full fine-tuning; the same principles apply when using LoRA adapters.

# Quick comparison: resource requirements
def estimate_training_resources(
    model_size_billions: float,
    method: str = "full",  # "full", "lora", "qlora"
    precision: str = "fp16"
) -> dict:
    """Estimate GPU memory and storage for fine-tuning."""
    bytes_per_param = {"fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5}

    param_bytes = bytes_per_param.get(precision, 2)
    model_memory_gb = model_size_billions * 1e9 * param_bytes / (1024**3)

    if method == "full":
        # Model + gradients + optimizer states (AdamW: 2x for momentum)
        training_memory_gb = model_memory_gb * 4  # Rough 4x multiplier
        trainable_params = model_size_billions * 1e9
        checkpoint_gb = model_memory_gb
    elif method == "lora":
        # Frozen model + small adapter gradients/optimizer
        trainable_params = model_size_billions * 1e9 * 0.01  # ~1% of params
        training_memory_gb = model_memory_gb + 2  # Base + adapter overhead
        checkpoint_gb = 0.1  # Adapter only
    elif method == "qlora":
        # 4-bit quantized model + adapter
        model_memory_gb = model_size_billions * 1e9 * 0.5 / (1024**3)
        trainable_params = model_size_billions * 1e9 * 0.01
        training_memory_gb = model_memory_gb + 2
        checkpoint_gb = 0.1

    return {
        "method": method,
        "model_size": f"{model_size_billions}B",
        "training_memory_gb": round(training_memory_gb, 1),
        "trainable_params": f"{trainable_params/1e6:.1f}M",
        "checkpoint_size_gb": round(checkpoint_gb, 1),
        "min_gpu": "A100 80GB" if training_memory_gb > 40 else "A100 40GB"
            if training_memory_gb > 20 else "RTX 4090 24GB"
            if training_memory_gb > 16 else "RTX 3090 24GB"
    }

# Compare methods for a 7B model
for method in ["full", "lora", "qlora"]:
    result = estimate_training_resources(7.0, method=method)
    print(f"{method:6s}: {result['training_memory_gb']:5.1f} GB, "
          f"{result['trainable_params']:>8s} params, "
          f"checkpoint: {result['checkpoint_size_gb']} GB")
full : 56.0 GB, 7000.0M params, checkpoint: 14.0 GB lora : 16.0 GB, 70.0M params, checkpoint: 0.1 GB qlora : 5.5 GB, 70.0M params, checkpoint: 0.1 GB

4. Catastrophic Forgetting

Catastrophic forgetting is the phenomenon where a model, after being fine-tuned on a specific task, loses its ability to perform well on other tasks it could previously handle. This happens because gradient updates that improve performance on the fine-tuning data can overwrite weights that encode general knowledge.

4.1 Symptoms and Causes

The most common symptoms of catastrophic forgetting include degraded performance on general benchmarks (MMLU, HellaSwag), loss of instruction-following ability, increased repetition or degenerate outputs, and inability to handle prompts outside the fine-tuning distribution. The primary causes are training for too many epochs, using a learning rate that is too high, training on a dataset that is too narrow in distribution, and failing to include regularization.

Catastrophic Forgetting Over Training Steps Training Steps Performance Target task General ability Optimal zone
Figure 13.3: As task-specific performance improves, general capabilities may degrade. The optimal checkpoint balances both.

4.2 Mitigation Strategies

# Strategies for mitigating catastrophic forgetting
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ForgettingMitigationConfig:
    """Configuration for preventing catastrophic forgetting."""

    # 1. Learning rate: use a low learning rate for fine-tuning
    learning_rate: float = 2e-5  # 10x lower than pre-training

    # 2. Short training: fewer epochs reduce overwriting
    num_epochs: int = 3  # Rarely need more than 3-5

    # 3. Data mixing: include general-purpose data
    task_data_ratio: float = 0.7   # 70% task-specific
    general_data_ratio: float = 0.3  # 30% general (e.g., OpenAssistant)

    # 4. Regularization
    weight_decay: float = 0.01
    max_grad_norm: float = 1.0

    # 5. Evaluation on general benchmarks during training
    eval_general_benchmarks: bool = True
    general_eval_datasets: List[str] = None

    def __post_init__(self):
        if self.general_eval_datasets is None:
            self.general_eval_datasets = [
                "mmlu",         # General knowledge
                "hellaswag",    # Commonsense reasoning
                "arc_easy",     # Science questions
            ]

    def get_data_mix(self, task_samples: int) -> dict:
        """Calculate how many general samples to mix in."""
        general_samples = int(
            task_samples * self.general_data_ratio / self.task_data_ratio
        )
        return {
            "task_samples": task_samples,
            "general_samples": general_samples,
            "total": task_samples + general_samples,
            "effective_task_ratio": task_samples / (task_samples + general_samples)
        }

config = ForgettingMitigationConfig()
mix = config.get_data_mix(task_samples=5000)
print(f"Task: {mix['task_samples']}, General: {mix['general_samples']}, "
      f"Total: {mix['total']}")
Task: 5000, General: 2142, Total: 7142
⚠ Warning

Do not skip general evaluation. Many teams only measure performance on their target task during fine-tuning and discover too late that the model has lost critical general capabilities. Always evaluate on at least 2 to 3 general benchmarks at every checkpoint. If general performance drops more than 5% from the base model, you are likely overtraining.

5. Continual Pre-Training vs. Instruction Fine-Tuning

Fine-tuning comes in two distinct flavors that serve different purposes. Continual pre-training (also called domain-adaptive pre-training) extends the original pre-training objective on domain-specific text. Instruction fine-tuning (also called supervised fine-tuning or SFT) trains the model to follow instructions and produce specific outputs. Understanding the difference is critical for choosing the right approach.

5.1 Continual Pre-Training

Continual pre-training uses the same next-token prediction objective as the original pre-training, but on a domain-specific corpus. The model learns the vocabulary, concepts, and reasoning patterns of the target domain without any explicit instruction/output pairs. This is useful when the model lacks fundamental domain knowledge.

5.2 Instruction Fine-Tuning (SFT)

Instruction fine-tuning trains the model on input/output pairs where each input is a user instruction or query and each output is the desired response. This teaches the model to follow instructions, produce specific output formats, and adopt particular behaviors. Most practical fine-tuning falls into this category.

AspectContinual Pre-TrainingInstruction Fine-Tuning (SFT)
Training objectiveNext-token prediction (causal LM)Supervised on instruction/output pairs
Data formatRaw text (documents, papers)Structured pairs (instruction, response)
Data quantityMillions to billions of tokensThousands to tens of thousands of examples
PurposeInject domain knowledgeTeach behavior and format
Typical useMedical, legal, financial modelsChatbots, task-specific assistants
Training durationDays to weeksHours to a day
ExampleTrain on 10B tokens of medical literatureTrain on 10K medical Q&A pairs
🔑 Key Insight

The two-stage pipeline. For domain-specific applications, the most effective approach is often a two-stage pipeline: first, continual pre-training on domain text to inject knowledge, then instruction fine-tuning to teach the model how to use that knowledge in response to user queries. This separates the "what to know" stage from the "how to behave" stage and typically produces better results than either stage alone.

Section 13.1 Quiz

1. A company needs a model that answers questions about internal policies that change quarterly. Which approach is most appropriate?
Show Answer
RAG is the best choice here. Since the policies change frequently (quarterly), retrieval-augmented generation allows the system to always serve the most current information without retraining. Fine-tuning would require retraining every quarter and risks serving stale information between updates.
2. What is the primary risk of fine-tuning a model for too many epochs on a narrow dataset?
Show Answer
Catastrophic forgetting. Training for too many epochs on a narrow dataset causes the model to overwrite general knowledge encoded in its weights. The model may excel at the specific task but lose its ability to handle general prompts, follow diverse instructions, or reason about topics outside the training distribution.
3. A startup wants to deploy a model that generates JSON output with a strict schema for 100,000 requests per day. Prompt engineering yields 85% schema compliance. What should they try next?
Show Answer
Fine-tuning on a dataset of correctly formatted JSON outputs. At 100K requests/day, the 15% failure rate from prompting means 15,000 failed requests daily. Fine-tuning typically achieves 97%+ schema compliance, reducing failures to 3,000 or fewer. The cost of fine-tuning is quickly recovered through reduced error handling and retry costs.
4. How does QLoRA reduce the GPU memory required for fine-tuning a 7B parameter model compared to full fine-tuning?
Show Answer
QLoRA reduces memory in two ways. First, the base model is quantized to 4-bit precision, reducing its memory footprint by approximately 4x compared to FP16. Second, only a small set of low-rank adapter parameters (roughly 1% of total parameters) require gradient computation and optimizer states. Together, this reduces GPU memory from approximately 56 GB to roughly 5.5 GB, making fine-tuning feasible on consumer GPUs.
5. What is the difference between continual pre-training and instruction fine-tuning?
Show Answer
Continual pre-training extends the original pre-training objective (next-token prediction) on domain-specific raw text, injecting domain knowledge and vocabulary into the model. Instruction fine-tuning trains on structured input/output pairs, teaching the model to follow instructions and produce specific response formats. They serve different purposes: continual pre-training teaches "what to know" while instruction fine-tuning teaches "how to behave."

Key Takeaways