Data quality is the single biggest lever in fine-tuning. A model fine-tuned on 1,000 high-quality examples will almost always outperform one fine-tuned on 10,000 noisy examples. This section covers the practical details of preparing training data: the standard dataset formats that tools expect, how chat templates control what the model actually sees during training, strategies for splitting and balancing your data, and sequence packing to maximize GPU utilization.
1. Standard Dataset Formats
The fine-tuning ecosystem has converged on a handful of standard formats. Each training framework (TRL, Axolotl, LLaMA-Factory) expects data in one of these formats. Understanding the differences helps you prepare data that works across tools without painful conversion steps.
1.1 Alpaca Format
The Alpaca format, introduced by Stanford's Alpaca project, is the simplest instruction-tuning format. Each example consists of an instruction, an optional input (context), and the expected output. This format works well for single-turn tasks like summarization, translation, and question answering.
# Alpaca format: instruction, input (optional), output
alpaca_examples = [
{
"instruction": "Summarize the following research paper abstract.",
"input": "We present a novel approach to protein structure prediction "
"using graph neural networks. Our method achieves state-of-the-art "
"results on CASP14 benchmarks, improving GDT-TS scores by 12% over "
"existing methods. We demonstrate that incorporating evolutionary "
"coupling information as edge features significantly enhances "
"prediction accuracy for proteins with low sequence identity.",
"output": "The paper introduces a graph neural network approach for protein "
"structure prediction that achieves a 12% improvement in GDT-TS "
"scores on CASP14 benchmarks. The key innovation is using "
"evolutionary coupling information as edge features, which "
"particularly helps with low-sequence-identity proteins."
},
{
"instruction": "Classify the sentiment of this product review.",
"input": "The battery life is incredible but the screen is too dim outdoors.",
"output": "Mixed sentiment. Positive: battery life. Negative: screen brightness."
},
{
# No input field: instruction-only format
"instruction": "What are three common causes of overfitting in neural networks?",
"output": "1. Training for too many epochs without early stopping.\n"
"2. Using a model with excessive capacity relative to the dataset size.\n"
"3. Insufficient regularization (no dropout, weight decay, or data augmentation)."
}
]
# Save as JSONL (one JSON object per line)
import json
with open("train_alpaca.jsonl", "w") as f:
for example in alpaca_examples:
f.write(json.dumps(example) + "\n")
1.2 ShareGPT / Conversational Format
The ShareGPT format supports multi-turn conversations, making it the standard for training chat models. Each example contains a list of conversation turns with alternating roles (human/assistant or user/assistant). This format naturally captures the back-and-forth dynamics of real conversations.
# ShareGPT format: multi-turn conversations
sharegpt_examples = [
{
"conversations": [
{
"from": "human",
"value": "I have a pandas DataFrame with columns 'date', 'product', "
"and 'revenue'. How do I find the top 5 products by total revenue?"
},
{
"from": "gpt",
"value": "You can use groupby with sum, then sort and take the head:\n\n"
"```python\n"
"top_5 = (df.groupby('product')['revenue']\n"
" .sum()\n"
" .sort_values(ascending=False)\n"
" .head(5))\n"
"```\n\n"
"This groups by product, sums the revenue for each, sorts in "
"descending order, and returns the top 5."
},
{
"from": "human",
"value": "What if I also want to see the number of transactions per product?"
},
{
"from": "gpt",
"value": "Use agg to compute multiple statistics at once:\n\n"
"```python\n"
"top_5 = (df.groupby('product')\n"
" .agg(\n"
" total_revenue=('revenue', 'sum'),\n"
" num_transactions=('revenue', 'count')\n"
" )\n"
" .sort_values('total_revenue', ascending=False)\n"
" .head(5))\n"
"```\n\n"
"This gives you both the total revenue and the transaction count "
"for each product in a single DataFrame."
}
]
}
]
# Save as JSONL
with open("train_sharegpt.jsonl", "w") as f:
for example in sharegpt_examples:
f.write(json.dumps(example) + "\n")
1.3 ChatML / Messages Format
The ChatML (Chat Markup Language) format uses the OpenAI-style messages array with explicit role labels. This has become the de facto standard for most modern training frameworks including TRL and is directly compatible with the OpenAI fine-tuning API.
# ChatML / Messages format (OpenAI-compatible)
chatml_examples = [
{
"messages": [
{
"role": "system",
"content": "You are a medical coding assistant. Assign ICD-10 codes "
"to clinical descriptions. Always provide the code and a "
"brief justification."
},
{
"role": "user",
"content": "Patient presents with acute lower back pain radiating to "
"the left leg, worse with sitting. MRI shows L4-L5 disc "
"herniation with nerve root compression."
},
{
"role": "assistant",
"content": "ICD-10 Code: M51.16\n\n"
"Justification: M51.16 covers intervertebral disc "
"degeneration with radiculopathy in the lumbar region. The "
"radiating leg pain indicates radiculopathy, and the MRI "
"confirms L4-L5 disc herniation with nerve compression, "
"which falls under this code."
}
]
}
]
# Validate the format
def validate_chatml(examples):
"""Validate ChatML format for common issues."""
issues = []
for i, ex in enumerate(examples):
messages = ex.get("messages", [])
if not messages:
issues.append(f"Example {i}: empty messages")
continue
# Check role sequence
roles = [m["role"] for m in messages]
if roles[-1] != "assistant":
issues.append(f"Example {i}: last message must be from assistant")
# Check for empty content
for j, msg in enumerate(messages):
if not msg.get("content", "").strip():
issues.append(f"Example {i}, message {j}: empty content")
return issues
issues = validate_chatml(chatml_examples)
print(f"Validation: {'PASSED' if not issues else issues}")
| Format | Multi-turn | System Prompt | Frameworks | Best For |
|---|---|---|---|---|
| Alpaca | No | No | Axolotl, LLaMA-Factory | Single-turn tasks |
| ShareGPT | Yes | Optional | Axolotl, LLaMA-Factory | Multi-turn chat data |
| ChatML / Messages | Yes | Yes | TRL, OpenAI API, Axolotl | Universal; production systems |
2. Chat Templates
A chat template defines how structured messages (system, user, assistant) are converted into the raw token sequence that the model actually processes during training and inference. Getting the chat template right is critical: a mismatch between training and inference templates is one of the most common causes of poor fine-tuning results.
from transformers import AutoTokenizer
# Load a tokenizer with its built-in chat template
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to reverse a string."},
{"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"}
]
# Apply the chat template to see the raw text
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False, # Return string, not token IDs
add_generation_prompt=False # Training: no trailing prompt
)
print(formatted)
# For training, we typically want tokenized output with labels
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt"
)
print(f"\nToken count: {tokenized['input_ids'].shape[1]}")
Template mismatch is a silent killer. If you fine-tune with one chat template but serve with another, the model will produce degraded output because the token patterns it learned do not match what it sees at inference time. Always verify that your training template matches your serving template by printing the formatted text and inspecting it manually before starting a training run.
3. Train / Validation / Test Splits
Proper data splitting prevents overfitting and gives you reliable signals about model quality. The standard approach for fine-tuning datasets differs from classical ML because fine-tuning datasets tend to be smaller and training is computationally expensive.
3.1 Recommended Split Ratios
| Dataset Size | Train | Validation | Test | Notes |
|---|---|---|---|---|
| < 1,000 | 80% | 10% | 10% | Every example counts; consider cross-validation |
| 1,000 to 10,000 | 85% | 7.5% | 7.5% | Standard ratio; good for most fine-tuning tasks |
| 10,000 to 100,000 | 90% | 5% | 5% | Validation set of 500+ is sufficient for stable metrics |
| > 100,000 | 95% | 2.5% | 2.5% | Large validation sets waste compute; 2,500 examples is plenty |
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
import json
def create_splits(
data_path: str,
train_ratio: float = 0.85,
val_ratio: float = 0.075,
test_ratio: float = 0.075,
seed: int = 42,
stratify_key: str = None
) -> DatasetDict:
"""Create train/validation/test splits from a JSONL file."""
# Load data
with open(data_path) as f:
examples = [json.loads(line) for line in f]
# Optional stratification (useful for classification tasks)
stratify = None
if stratify_key:
stratify = [ex.get(stratify_key) for ex in examples]
# First split: train vs. (val + test)
train_data, remaining = train_test_split(
examples,
test_size=(val_ratio + test_ratio),
random_state=seed,
stratify=stratify
)
# Second split: val vs. test
relative_test = test_ratio / (val_ratio + test_ratio)
stratify_remaining = None
if stratify_key:
stratify_remaining = [ex.get(stratify_key) for ex in remaining]
val_data, test_data = train_test_split(
remaining,
test_size=relative_test,
random_state=seed,
stratify=stratify_remaining
)
# Convert to HuggingFace Datasets
splits = DatasetDict({
"train": Dataset.from_list(train_data),
"validation": Dataset.from_list(val_data),
"test": Dataset.from_list(test_data),
})
print(f"Split sizes: train={len(train_data)}, "
f"val={len(val_data)}, test={len(test_data)}")
return splits
# Usage
# splits = create_splits("train_chatml.jsonl", stratify_key="task_type")
Avoid data leakage in multi-turn conversations. When splitting conversational data, ensure that all turns from a single conversation stay in the same split. Splitting individual turns across train and validation sets creates a form of data leakage where the model has seen partial conversations during training that it is then evaluated on.
4. Data Mixing and Balancing
Real-world fine-tuning datasets often combine multiple tasks, topics, or data sources. How you mix these components significantly affects training outcomes. The goal is to create a balanced training signal that teaches the model all desired capabilities without any single category dominating the gradient updates.
4.1 Multi-Task Mixing
from datasets import Dataset, concatenate_datasets
import random
def mix_datasets(
datasets: dict, # {"task_name": Dataset}
target_ratios: dict, # {"task_name": 0.3}
total_size: int = None, # Target total size
seed: int = 42
) -> Dataset:
"""Mix multiple datasets with specified ratios."""
random.seed(seed)
if total_size is None:
# Default: use the smallest dataset scaled by its ratio
min_effective = min(
len(ds) / ratio
for ds, ratio in zip(datasets.values(), target_ratios.values())
)
total_size = int(min_effective)
mixed_parts = []
for task_name, dataset in datasets.items():
ratio = target_ratios[task_name]
n_samples = int(total_size * ratio)
if n_samples <= len(dataset):
# Subsample
indices = random.sample(range(len(dataset)), n_samples)
sampled = dataset.select(indices)
else:
# Oversample (repeat with shuffling)
repeats = n_samples // len(dataset) + 1
indices = list(range(len(dataset))) * repeats
random.shuffle(indices)
sampled = dataset.select(indices[:n_samples])
mixed_parts.append(sampled)
print(f" {task_name}: {len(dataset)} -> {n_samples} samples ({ratio:.0%})")
# Concatenate and shuffle
combined = concatenate_datasets(mixed_parts)
combined = combined.shuffle(seed=seed)
print(f"\nTotal mixed dataset: {len(combined)} examples")
return combined
Upsample rare tasks; downsample common ones. If you have 50,000 classification examples but only 5,000 summarization examples, training on the raw distribution will bias the model heavily toward classification. Use square-root sampling or explicit ratio targets to ensure all tasks receive adequate representation. A common heuristic is to take the square root of each category's count and normalize to get sampling probabilities.
5. Sequence Packing
By default, training batches pad all sequences to the length of the longest sequence in the batch. This wastes significant compute when your dataset contains sequences of varying lengths. Sequence packing solves this by concatenating multiple short examples into a single sequence of the target length, separated by special tokens.
5.1 Why Packing Matters
Consider a dataset where the average sequence length is 256 tokens but the maximum is 2,048. Without packing, every batch pads all sequences to 2,048 tokens, meaning 87% of the compute is wasted on padding tokens. With packing, you fit roughly 8 short examples into a single 2,048-token sequence, achieving near-100% GPU utilization.
from typing import List, Dict
import numpy as np
def pack_sequences(
examples: List[Dict],
tokenizer,
max_seq_length: int = 2048,
pad_token_id: int = None
) -> List[Dict]:
"""Pack multiple examples into fixed-length sequences."""
if pad_token_id is None:
pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id
packed = []
current_input_ids = []
current_attention_mask = []
current_labels = []
for example in examples:
tokens = tokenizer(
example["text"],
truncation=True,
max_length=max_seq_length,
add_special_tokens=True
)
example_ids = tokens["input_ids"]
# Check if adding this example would exceed max length
if len(current_input_ids) + len(example_ids) > max_seq_length:
# Pad the current sequence and save it
pad_length = max_seq_length - len(current_input_ids)
current_input_ids.extend([pad_token_id] * pad_length)
current_attention_mask.extend([0] * pad_length)
current_labels.extend([-100] * pad_length)
packed.append({
"input_ids": current_input_ids,
"attention_mask": current_attention_mask,
"labels": current_labels
})
# Start a new packed sequence
current_input_ids = []
current_attention_mask = []
current_labels = []
# Add this example to the current sequence
current_input_ids.extend(example_ids)
current_attention_mask.extend([1] * len(example_ids))
current_labels.extend(example_ids) # Causal LM: labels = input_ids
# Save the last sequence if non-empty
if current_input_ids:
pad_length = max_seq_length - len(current_input_ids)
current_input_ids.extend([pad_token_id] * pad_length)
current_attention_mask.extend([0] * pad_length)
current_labels.extend([-100] * pad_length)
packed.append({
"input_ids": current_input_ids,
"attention_mask": current_attention_mask,
"labels": current_labels
})
return packed
# Calculate efficiency improvement
def packing_efficiency(lengths: List[int], max_length: int) -> dict:
"""Compare padding waste vs. packing efficiency."""
# Without packing: pad each to max_length
padded_tokens = len(lengths) * max_length
useful_tokens_padded = sum(lengths)
pad_efficiency = useful_tokens_padded / padded_tokens
# With packing: fit multiple examples per sequence
packed_sequences = 0
current_length = 0
for length in sorted(lengths):
if current_length + length > max_length:
packed_sequences += 1
current_length = 0
current_length += length
if current_length > 0:
packed_sequences += 1
packed_tokens = packed_sequences * max_length
pack_efficiency = useful_tokens_padded / packed_tokens
return {
"sequences_without_packing": len(lengths),
"sequences_with_packing": packed_sequences,
"efficiency_without_packing": f"{pad_efficiency:.1%}",
"efficiency_with_packing": f"{pack_efficiency:.1%}",
"speedup": f"{len(lengths) / packed_sequences:.1f}x"
}
# Example with realistic distribution
np.random.seed(42)
lengths = np.random.lognormal(mean=5.5, sigma=0.8, size=10000).astype(int)
lengths = np.clip(lengths, 50, 2048)
result = packing_efficiency(lengths.tolist(), max_length=2048)
for k, v in result.items():
print(f" {k}: {v}")
TRL handles packing automatically. When using TRL's SFTTrainer, set packing=True in the SFTConfig to enable automatic sequence packing. The trainer will concatenate examples with EOS token separators and handle the attention mask correctly so that examples do not attend to each other within a packed sequence.
6. Data Quality Checklist
Before starting any fine-tuning run, walk through this checklist to catch common data issues that lead to poor training outcomes.
def data_quality_audit(dataset, tokenizer, max_seq_length=2048):
"""Run a comprehensive data quality audit before training."""
report = {
"total_examples": len(dataset),
"issues": [],
"warnings": [],
"stats": {}
}
lengths = []
empty_count = 0
duplicate_count = 0
seen_hashes = set()
for i, example in enumerate(dataset):
messages = example.get("messages", [])
# Check for empty messages
for msg in messages:
if not msg.get("content", "").strip():
empty_count += 1
# Check for duplicates (hash-based)
content_hash = hash(str(messages))
if content_hash in seen_hashes:
duplicate_count += 1
seen_hashes.add(content_hash)
# Tokenize and check length
text = tokenizer.apply_chat_template(messages, tokenize=False)
tokens = tokenizer(text)["input_ids"]
lengths.append(len(tokens))
# Check for truncation
if len(tokens) > max_seq_length:
report["warnings"].append(
f"Example {i}: {len(tokens)} tokens (will be truncated)"
)
# Summary statistics
import numpy as np
lengths = np.array(lengths)
report["stats"] = {
"mean_length": f"{lengths.mean():.0f}",
"median_length": f"{np.median(lengths):.0f}",
"p95_length": f"{np.percentile(lengths, 95):.0f}",
"max_length": f"{lengths.max():.0f}",
"truncated_pct": f"{(lengths > max_seq_length).mean():.1%}",
"empty_messages": empty_count,
"duplicates": duplicate_count,
}
# Flag issues
if empty_count > 0:
report["issues"].append(f"{empty_count} empty messages found")
if duplicate_count > len(dataset) * 0.05:
report["issues"].append(f"{duplicate_count} duplicates ({duplicate_count/len(dataset):.1%})")
if (lengths > max_seq_length).mean() > 0.1:
report["issues"].append(f"{(lengths > max_seq_length).mean():.1%} examples will be truncated")
return report
Garbage in, garbage out. No amount of hyperparameter tuning or clever training tricks can compensate for low-quality training data. Invest time in data cleaning, deduplication, and manual review of a random sample before each training run. A one-hour manual review of 100 random examples will often reveal systematic issues (incorrect labels, formatting inconsistencies, truncated responses) that would otherwise waste days of training compute.
Section 13.2 Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Use ChatML/messages format as the default for new projects; it is the most widely supported across training frameworks and provider APIs.
- Chat templates are critical: always verify that training and inference templates match by printing and inspecting the formatted text.
- Balance multi-task datasets using square-root sampling or explicit ratio targets; never train on the raw distribution when category sizes are highly imbalanced.
- Enable sequence packing (set
packing=Truein TRL) for a 3x to 8x training throughput improvement with no quality cost. - Audit data quality before every training run: check for duplicates, empty messages, truncation rates, and manually review a sample of 50 to 100 examples.
- Keep conversation integrity when splitting: all turns from a single conversation must stay in the same split to avoid data leakage.