Section 13.2: Data Preparation for Fine-Tuning

★ Big Picture

Data quality is the single biggest lever in fine-tuning. A model fine-tuned on 1,000 high-quality examples will almost always outperform one fine-tuned on 10,000 noisy examples. This section covers the practical details of preparing training data: the standard dataset formats that tools expect, how chat templates control what the model actually sees during training, strategies for splitting and balancing your data, and sequence packing to maximize GPU utilization.

1. Standard Dataset Formats

The fine-tuning ecosystem has converged on a handful of standard formats. Each training framework (TRL, Axolotl, LLaMA-Factory) expects data in one of these formats. Understanding the differences helps you prepare data that works across tools without painful conversion steps.

1.1 Alpaca Format

The Alpaca format, introduced by Stanford's Alpaca project, is the simplest instruction-tuning format. Each example consists of an instruction, an optional input (context), and the expected output. This format works well for single-turn tasks like summarization, translation, and question answering.

# Alpaca format: instruction, input (optional), output
alpaca_examples = [
    {
        "instruction": "Summarize the following research paper abstract.",
        "input": "We present a novel approach to protein structure prediction "
                 "using graph neural networks. Our method achieves state-of-the-art "
                 "results on CASP14 benchmarks, improving GDT-TS scores by 12% over "
                 "existing methods. We demonstrate that incorporating evolutionary "
                 "coupling information as edge features significantly enhances "
                 "prediction accuracy for proteins with low sequence identity.",
        "output": "The paper introduces a graph neural network approach for protein "
                  "structure prediction that achieves a 12% improvement in GDT-TS "
                  "scores on CASP14 benchmarks. The key innovation is using "
                  "evolutionary coupling information as edge features, which "
                  "particularly helps with low-sequence-identity proteins."
    },
    {
        "instruction": "Classify the sentiment of this product review.",
        "input": "The battery life is incredible but the screen is too dim outdoors.",
        "output": "Mixed sentiment. Positive: battery life. Negative: screen brightness."
    },
    {
        # No input field: instruction-only format
        "instruction": "What are three common causes of overfitting in neural networks?",
        "output": "1. Training for too many epochs without early stopping.\n"
                  "2. Using a model with excessive capacity relative to the dataset size.\n"
                  "3. Insufficient regularization (no dropout, weight decay, or data augmentation)."
    }
]

# Save as JSONL (one JSON object per line)
import json

with open("train_alpaca.jsonl", "w") as f:
    for example in alpaca_examples:
        f.write(json.dumps(example) + "\n")

1.2 ShareGPT / Conversational Format

The ShareGPT format supports multi-turn conversations, making it the standard for training chat models. Each example contains a list of conversation turns with alternating roles (human/assistant or user/assistant). This format naturally captures the back-and-forth dynamics of real conversations.

# ShareGPT format: multi-turn conversations
sharegpt_examples = [
    {
        "conversations": [
            {
                "from": "human",
                "value": "I have a pandas DataFrame with columns 'date', 'product', "
                         "and 'revenue'. How do I find the top 5 products by total revenue?"
            },
            {
                "from": "gpt",
                "value": "You can use groupby with sum, then sort and take the head:\n\n"
                         "```python\n"
                         "top_5 = (df.groupby('product')['revenue']\n"
                         "           .sum()\n"
                         "           .sort_values(ascending=False)\n"
                         "           .head(5))\n"
                         "```\n\n"
                         "This groups by product, sums the revenue for each, sorts in "
                         "descending order, and returns the top 5."
            },
            {
                "from": "human",
                "value": "What if I also want to see the number of transactions per product?"
            },
            {
                "from": "gpt",
                "value": "Use agg to compute multiple statistics at once:\n\n"
                         "```python\n"
                         "top_5 = (df.groupby('product')\n"
                         "           .agg(\n"
                         "               total_revenue=('revenue', 'sum'),\n"
                         "               num_transactions=('revenue', 'count')\n"
                         "           )\n"
                         "           .sort_values('total_revenue', ascending=False)\n"
                         "           .head(5))\n"
                         "```\n\n"
                         "This gives you both the total revenue and the transaction count "
                         "for each product in a single DataFrame."
            }
        ]
    }
]

# Save as JSONL
with open("train_sharegpt.jsonl", "w") as f:
    for example in sharegpt_examples:
        f.write(json.dumps(example) + "\n")

1.3 ChatML / Messages Format

The ChatML (Chat Markup Language) format uses the OpenAI-style messages array with explicit role labels. This has become the de facto standard for most modern training frameworks including TRL and is directly compatible with the OpenAI fine-tuning API.

# ChatML / Messages format (OpenAI-compatible)
chatml_examples = [
    {
        "messages": [
            {
                "role": "system",
                "content": "You are a medical coding assistant. Assign ICD-10 codes "
                           "to clinical descriptions. Always provide the code and a "
                           "brief justification."
            },
            {
                "role": "user",
                "content": "Patient presents with acute lower back pain radiating to "
                           "the left leg, worse with sitting. MRI shows L4-L5 disc "
                           "herniation with nerve root compression."
            },
            {
                "role": "assistant",
                "content": "ICD-10 Code: M51.16\n\n"
                           "Justification: M51.16 covers intervertebral disc "
                           "degeneration with radiculopathy in the lumbar region. The "
                           "radiating leg pain indicates radiculopathy, and the MRI "
                           "confirms L4-L5 disc herniation with nerve compression, "
                           "which falls under this code."
            }
        ]
    }
]

# Validate the format
def validate_chatml(examples):
    """Validate ChatML format for common issues."""
    issues = []
    for i, ex in enumerate(examples):
        messages = ex.get("messages", [])
        if not messages:
            issues.append(f"Example {i}: empty messages")
            continue

        # Check role sequence
        roles = [m["role"] for m in messages]
        if roles[-1] != "assistant":
            issues.append(f"Example {i}: last message must be from assistant")

        # Check for empty content
        for j, msg in enumerate(messages):
            if not msg.get("content", "").strip():
                issues.append(f"Example {i}, message {j}: empty content")

    return issues

issues = validate_chatml(chatml_examples)
print(f"Validation: {'PASSED' if not issues else issues}")

Validation: PASSED

Format	Multi-turn	System Prompt	Frameworks	Best For
Alpaca	No	No	Axolotl, LLaMA-Factory	Single-turn tasks
ShareGPT	Yes	Optional	Axolotl, LLaMA-Factory	Multi-turn chat data
ChatML / Messages	Yes	Yes	TRL, OpenAI API, Axolotl	Universal; production systems

2. Chat Templates

A chat template defines how structured messages (system, user, assistant) are converted into the raw token sequence that the model actually processes during training and inference. Getting the chat template right is critical: a mismatch between training and inference templates is one of the most common causes of poor fine-tuning results.

Figure 13.4: Chat templates convert structured messages into the token format the model was trained to expect

from transformers import AutoTokenizer

# Load a tokenizer with its built-in chat template
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "Write a Python function to reverse a string."},
    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"}
]

# Apply the chat template to see the raw text
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,    # Return string, not token IDs
    add_generation_prompt=False  # Training: no trailing prompt
)
print(formatted)

# For training, we typically want tokenized output with labels
tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
)
print(f"\nToken count: {tokenized['input_ids'].shape[1]}")

⚠ Warning

Template mismatch is a silent killer. If you fine-tune with one chat template but serve with another, the model will produce degraded output because the token patterns it learned do not match what it sees at inference time. Always verify that your training template matches your serving template by printing the formatted text and inspecting it manually before starting a training run.

3. Train / Validation / Test Splits

Proper data splitting prevents overfitting and gives you reliable signals about model quality. The standard approach for fine-tuning datasets differs from classical ML because fine-tuning datasets tend to be smaller and training is computationally expensive.

3.1 Recommended Split Ratios

Dataset Size	Train	Validation	Test	Notes
< 1,000	80%	10%	10%	Every example counts; consider cross-validation
1,000 to 10,000	85%	7.5%	7.5%	Standard ratio; good for most fine-tuning tasks
10,000 to 100,000	90%	5%	5%	Validation set of 500+ is sufficient for stable metrics
> 100,000	95%	2.5%	2.5%	Large validation sets waste compute; 2,500 examples is plenty

from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
import json

def create_splits(
    data_path: str,
    train_ratio: float = 0.85,
    val_ratio: float = 0.075,
    test_ratio: float = 0.075,
    seed: int = 42,
    stratify_key: str = None
) -> DatasetDict:
    """Create train/validation/test splits from a JSONL file."""
    # Load data
    with open(data_path) as f:
        examples = [json.loads(line) for line in f]

    # Optional stratification (useful for classification tasks)
    stratify = None
    if stratify_key:
        stratify = [ex.get(stratify_key) for ex in examples]

    # First split: train vs. (val + test)
    train_data, remaining = train_test_split(
        examples,
        test_size=(val_ratio + test_ratio),
        random_state=seed,
        stratify=stratify
    )

    # Second split: val vs. test
    relative_test = test_ratio / (val_ratio + test_ratio)
    stratify_remaining = None
    if stratify_key:
        stratify_remaining = [ex.get(stratify_key) for ex in remaining]

    val_data, test_data = train_test_split(
        remaining,
        test_size=relative_test,
        random_state=seed,
        stratify=stratify_remaining
    )

    # Convert to HuggingFace Datasets
    splits = DatasetDict({
        "train": Dataset.from_list(train_data),
        "validation": Dataset.from_list(val_data),
        "test": Dataset.from_list(test_data),
    })

    print(f"Split sizes: train={len(train_data)}, "
          f"val={len(val_data)}, test={len(test_data)}")
    return splits

# Usage
# splits = create_splits("train_chatml.jsonl", stratify_key="task_type")

📝 Note

Avoid data leakage in multi-turn conversations. When splitting conversational data, ensure that all turns from a single conversation stay in the same split. Splitting individual turns across train and validation sets creates a form of data leakage where the model has seen partial conversations during training that it is then evaluated on.

4. Data Mixing and Balancing

Real-world fine-tuning datasets often combine multiple tasks, topics, or data sources. How you mix these components significantly affects training outcomes. The goal is to create a balanced training signal that teaches the model all desired capabilities without any single category dominating the gradient updates.

4.1 Multi-Task Mixing

Figure 13.5: Data mixing balances task categories to prevent any single task from dominating training

from datasets import Dataset, concatenate_datasets
import random

def mix_datasets(
    datasets: dict,          # {"task_name": Dataset}
    target_ratios: dict,     # {"task_name": 0.3}
    total_size: int = None,  # Target total size
    seed: int = 42
) -> Dataset:
    """Mix multiple datasets with specified ratios."""
    random.seed(seed)

    if total_size is None:
        # Default: use the smallest dataset scaled by its ratio
        min_effective = min(
            len(ds) / ratio
            for ds, ratio in zip(datasets.values(), target_ratios.values())
        )
        total_size = int(min_effective)

    mixed_parts = []
    for task_name, dataset in datasets.items():
        ratio = target_ratios[task_name]
        n_samples = int(total_size * ratio)

        if n_samples <= len(dataset):
            # Subsample
            indices = random.sample(range(len(dataset)), n_samples)
            sampled = dataset.select(indices)
        else:
            # Oversample (repeat with shuffling)
            repeats = n_samples // len(dataset) + 1
            indices = list(range(len(dataset))) * repeats
            random.shuffle(indices)
            sampled = dataset.select(indices[:n_samples])

        mixed_parts.append(sampled)
        print(f"  {task_name}: {len(dataset)} -> {n_samples} samples ({ratio:.0%})")

    # Concatenate and shuffle
    combined = concatenate_datasets(mixed_parts)
    combined = combined.shuffle(seed=seed)

    print(f"\nTotal mixed dataset: {len(combined)} examples")
    return combined

🔑 Key Insight

Upsample rare tasks; downsample common ones. If you have 50,000 classification examples but only 5,000 summarization examples, training on the raw distribution will bias the model heavily toward classification. Use square-root sampling or explicit ratio targets to ensure all tasks receive adequate representation. A common heuristic is to take the square root of each category's count and normalize to get sampling probabilities.

5. Sequence Packing

By default, training batches pad all sequences to the length of the longest sequence in the batch. This wastes significant compute when your dataset contains sequences of varying lengths. Sequence packing solves this by concatenating multiple short examples into a single sequence of the target length, separated by special tokens.

5.1 Why Packing Matters

Consider a dataset where the average sequence length is 256 tokens but the maximum is 2,048. Without packing, every batch pads all sequences to 2,048 tokens, meaning 87% of the compute is wasted on padding tokens. With packing, you fit roughly 8 short examples into a single 2,048-token sequence, achieving near-100% GPU utilization.

from typing import List, Dict
import numpy as np

def pack_sequences(
    examples: List[Dict],
    tokenizer,
    max_seq_length: int = 2048,
    pad_token_id: int = None
) -> List[Dict]:
    """Pack multiple examples into fixed-length sequences."""
    if pad_token_id is None:
        pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id

    packed = []
    current_input_ids = []
    current_attention_mask = []
    current_labels = []

    for example in examples:
        tokens = tokenizer(
            example["text"],
            truncation=True,
            max_length=max_seq_length,
            add_special_tokens=True
        )
        example_ids = tokens["input_ids"]

        # Check if adding this example would exceed max length
        if len(current_input_ids) + len(example_ids) > max_seq_length:
            # Pad the current sequence and save it
            pad_length = max_seq_length - len(current_input_ids)
            current_input_ids.extend([pad_token_id] * pad_length)
            current_attention_mask.extend([0] * pad_length)
            current_labels.extend([-100] * pad_length)

            packed.append({
                "input_ids": current_input_ids,
                "attention_mask": current_attention_mask,
                "labels": current_labels
            })

            # Start a new packed sequence
            current_input_ids = []
            current_attention_mask = []
            current_labels = []

        # Add this example to the current sequence
        current_input_ids.extend(example_ids)
        current_attention_mask.extend([1] * len(example_ids))
        current_labels.extend(example_ids)  # Causal LM: labels = input_ids

    # Save the last sequence if non-empty
    if current_input_ids:
        pad_length = max_seq_length - len(current_input_ids)
        current_input_ids.extend([pad_token_id] * pad_length)
        current_attention_mask.extend([0] * pad_length)
        current_labels.extend([-100] * pad_length)
        packed.append({
            "input_ids": current_input_ids,
            "attention_mask": current_attention_mask,
            "labels": current_labels
        })

    return packed

# Calculate efficiency improvement
def packing_efficiency(lengths: List[int], max_length: int) -> dict:
    """Compare padding waste vs. packing efficiency."""
    # Without packing: pad each to max_length
    padded_tokens = len(lengths) * max_length
    useful_tokens_padded = sum(lengths)
    pad_efficiency = useful_tokens_padded / padded_tokens

    # With packing: fit multiple examples per sequence
    packed_sequences = 0
    current_length = 0
    for length in sorted(lengths):
        if current_length + length > max_length:
            packed_sequences += 1
            current_length = 0
        current_length += length
    if current_length > 0:
        packed_sequences += 1

    packed_tokens = packed_sequences * max_length
    pack_efficiency = useful_tokens_padded / packed_tokens

    return {
        "sequences_without_packing": len(lengths),
        "sequences_with_packing": packed_sequences,
        "efficiency_without_packing": f"{pad_efficiency:.1%}",
        "efficiency_with_packing": f"{pack_efficiency:.1%}",
        "speedup": f"{len(lengths) / packed_sequences:.1f}x"
    }

# Example with realistic distribution
np.random.seed(42)
lengths = np.random.lognormal(mean=5.5, sigma=0.8, size=10000).astype(int)
lengths = np.clip(lengths, 50, 2048)
result = packing_efficiency(lengths.tolist(), max_length=2048)
for k, v in result.items():
    print(f"  {k}: {v}")

sequences_without_packing: 10000 sequences_with_packing: 1702 efficiency_without_packing: 16.6% efficiency_with_packing: 97.5% speedup: 5.9x

📝 Note

TRL handles packing automatically. When using TRL's SFTTrainer, set packing=True in the SFTConfig to enable automatic sequence packing. The trainer will concatenate examples with EOS token separators and handle the attention mask correctly so that examples do not attend to each other within a packed sequence.

6. Data Quality Checklist

Before starting any fine-tuning run, walk through this checklist to catch common data issues that lead to poor training outcomes.

def data_quality_audit(dataset, tokenizer, max_seq_length=2048):
    """Run a comprehensive data quality audit before training."""
    report = {
        "total_examples": len(dataset),
        "issues": [],
        "warnings": [],
        "stats": {}
    }

    lengths = []
    empty_count = 0
    duplicate_count = 0
    seen_hashes = set()

    for i, example in enumerate(dataset):
        messages = example.get("messages", [])

        # Check for empty messages
        for msg in messages:
            if not msg.get("content", "").strip():
                empty_count += 1

        # Check for duplicates (hash-based)
        content_hash = hash(str(messages))
        if content_hash in seen_hashes:
            duplicate_count += 1
        seen_hashes.add(content_hash)

        # Tokenize and check length
        text = tokenizer.apply_chat_template(messages, tokenize=False)
        tokens = tokenizer(text)["input_ids"]
        lengths.append(len(tokens))

        # Check for truncation
        if len(tokens) > max_seq_length:
            report["warnings"].append(
                f"Example {i}: {len(tokens)} tokens (will be truncated)"
            )

    # Summary statistics
    import numpy as np
    lengths = np.array(lengths)
    report["stats"] = {
        "mean_length": f"{lengths.mean():.0f}",
        "median_length": f"{np.median(lengths):.0f}",
        "p95_length": f"{np.percentile(lengths, 95):.0f}",
        "max_length": f"{lengths.max():.0f}",
        "truncated_pct": f"{(lengths > max_seq_length).mean():.1%}",
        "empty_messages": empty_count,
        "duplicates": duplicate_count,
    }

    # Flag issues
    if empty_count > 0:
        report["issues"].append(f"{empty_count} empty messages found")
    if duplicate_count > len(dataset) * 0.05:
        report["issues"].append(f"{duplicate_count} duplicates ({duplicate_count/len(dataset):.1%})")
    if (lengths > max_seq_length).mean() > 0.1:
        report["issues"].append(f"{(lengths > max_seq_length).mean():.1%} examples will be truncated")

    return report

⚠ Warning

Garbage in, garbage out. No amount of hyperparameter tuning or clever training tricks can compensate for low-quality training data. Invest time in data cleaning, deduplication, and manual review of a random sample before each training run. A one-hour manual review of 100 random examples will often reveal systematic issues (incorrect labels, formatting inconsistencies, truncated responses) that would otherwise waste days of training compute.

Section 13.2 Quiz

1. What is the key difference between Alpaca format and ShareGPT format?

Show Answer

Alpaca format supports only single-turn interactions with an instruction, optional input, and output. ShareGPT format supports multi-turn conversations as a list of message objects with alternating roles (human/gpt). ShareGPT is better for training chat models that need to handle back-and-forth dialogue.

2. Why is it critical that the chat template used during training matches the one used at inference time?

Show Answer

The chat template determines the exact token sequence the model processes. If the training template uses different special tokens, role markers, or formatting than the inference template, the model encounters token patterns at inference time that it never learned during training. This mismatch degrades output quality, often causing the model to generate garbled or repetitive text.

3. A dataset has 50,000 classification examples and 2,000 summarization examples. What mixing strategy would you recommend?

Show Answer

Use square-root sampling or explicit ratio targets to balance the categories. With square-root sampling, classification would get sqrt(50000) = 224 relative weight and summarization sqrt(2000) = 45. Normalizing: classification gets 83% and summarization 17%. Alternatively, set explicit ratios like 60/40 or 50/50. The key is to avoid the raw 96%/4% distribution, which would cause the model to barely learn summarization.

4. What speedup does sequence packing typically provide, and how does it work?

Show Answer

Sequence packing typically provides a 3x to 8x speedup by concatenating multiple short examples into single fixed-length sequences, separated by EOS tokens. Without packing, short examples are padded to the maximum sequence length, wasting compute on padding tokens. With packing, GPU utilization approaches 100% because nearly every token in the batch is a meaningful training signal rather than padding.

5. What are the three most important data quality checks to perform before starting a fine-tuning run?

Show Answer

The three most critical checks are: (1) Deduplication, as duplicate examples cause the model to memorize rather than generalize. (2) Length analysis, to understand how many examples will be truncated at the chosen max_seq_length and adjust accordingly. (3) Manual review of a random sample (at least 50 to 100 examples) to catch systematic issues like incorrect labels, formatting problems, or low-quality responses that automated checks miss.

Key Takeaways

Use ChatML/messages format as the default for new projects; it is the most widely supported across training frameworks and provider APIs.
Chat templates are critical: always verify that training and inference templates match by printing and inspecting the formatted text.
Balance multi-task datasets using square-root sampling or explicit ratio targets; never train on the raw distribution when category sizes are highly imbalanced.
Enable sequence packing (set packing=True in TRL) for a 3x to 8x training throughput improvement with no quality cost.
Audit data quality before every training run: check for duplicates, empty messages, truncation rates, and manually review a sample of 50 to 100 examples.
Keep conversation integrity when splitting: all turns from a single conversation must stay in the same split to avoid data leakage.