From manual curation to automated factories. The most successful open-source models (Llama, Phi, Mistral) were trained on datasets built by sophisticated generation pipelines, not by armies of human annotators. These pipelines use LLMs themselves as data generators, applying techniques like Self-Instruct (generate instructions from a seed set), Evol-Instruct (progressively evolve instructions to increase complexity), and persona-driven generation (simulate diverse expert perspectives). This section teaches you to build these pipelines from scratch.
1. Self-Instruct: Bootstrapping from Seeds
Self-Instruct, introduced by Wang et al. (2023), is the foundational technique for LLM-based data generation. The approach starts with a small set of human-written seed instructions (typically 100 to 200) and uses an LLM to generate new instructions, classify them, and produce responses. The key innovation is that the LLM generates both the task description and the solution, creating complete training examples with minimal human involvement.
1.1 The Self-Instruct Pipeline
import json
import random
from openai import OpenAI
client = OpenAI()
# Seed instructions (in practice, use 150-200 diverse examples)
SEED_INSTRUCTIONS = [
"Write a Python function that reverses a linked list.",
"Explain the difference between TCP and UDP protocols.",
"Summarize the key principles of object-oriented programming.",
"Convert the following CSV data into a JSON format.",
"What are the pros and cons of microservices architecture?"
]
def self_instruct_generate(
seed_pool: list[str],
num_examples: int = 8,
model: str = "gpt-4o"
) -> dict:
"""Generate a new instruction-response pair using Self-Instruct."""
# Step 1: Sample from the seed pool
sampled = random.sample(seed_pool, min(num_examples, len(seed_pool)))
examples_text = "\n".join(f"{i+1}. {inst}" for i, inst in enumerate(sampled))
# Step 2: Generate a new instruction
gen_prompt = f"""Here are {len(sampled)} example task instructions:
{examples_text}
Generate a completely NEW and DIFFERENT task instruction that:
- Is distinct from all the examples above
- Is specific and actionable
- Can be answered in a single response
- Covers a different topic or skill
New instruction:"""
gen_response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": gen_prompt}],
temperature=1.0,
max_tokens=200
)
new_instruction = gen_response.choices[0].message.content.strip()
# Step 3: Generate the response
resp_response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Provide a thorough, accurate, "
"and well-structured response to the following instruction."},
{"role": "user", "content": new_instruction}
],
temperature=0.7,
max_tokens=1024
)
response_text = resp_response.choices[0].message.content.strip()
return {
"instruction": new_instruction,
"response": response_text,
"source": "self-instruct",
"seed_count": len(sampled)
}
# Generate a batch
pool = SEED_INSTRUCTIONS.copy()
generated = []
for i in range(3):
pair = self_instruct_generate(pool)
generated.append(pair)
pool.append(pair["instruction"]) # Add back to pool
print(f"Generated {i+1}: {pair['instruction'][:60]}...")
The Self-Instruct paper showed that by starting with just 175 human-written seed tasks, the pipeline could generate 52,000 instruction-response pairs. The critical innovation is the bootstrapping loop: each batch of generated instructions gets added back to the pool, so the diversity of the pool grows over time. However, this same feedback loop can lead to mode collapse if not carefully monitored, so periodic diversity checks are essential.
2. Evol-Instruct: Progressive Complexity Evolution
Evol-Instruct, developed for the WizardLM project, takes a different approach. Instead of generating new instructions from scratch, it starts with existing simple instructions and evolves them through a series of transformation operations that increase complexity, add constraints, or deepen reasoning requirements. This produces a natural curriculum from easy to hard examples.
2.1 Evolution Operations
| Operation | Description | Example Transformation |
|---|---|---|
| Add Constraints | Add requirements or restrictions | "Sort a list" becomes "Sort a list of dictionaries by multiple keys with custom comparators" |
| Deepen | Require more reasoning steps | "Explain recursion" becomes "Explain how recursion handles the Tower of Hanoi problem with a step-by-step trace" |
| Concretize | Replace abstract with specific | "Analyze data" becomes "Analyze monthly sales data for seasonal trends using pandas" |
| Increase Reasoning | Require multi-step logic | "What is Big-O?" becomes "Compare the time complexity of merge sort vs. quicksort in best, average, and worst cases, explaining why" |
| Complicate Input | Make the input data harder | "Parse JSON" becomes "Parse nested JSON with inconsistent schemas and missing fields" |
EVOLUTION_OPERATIONS = {
"add_constraints": """Rewrite the following instruction by adding 2-3
specific constraints or requirements that make it more challenging.
The evolved instruction should require more careful thinking.
Original: {instruction}
Evolved (with added constraints):""",
"deepen": """Rewrite the following instruction to require deeper
reasoning, more steps, or more thorough analysis. The evolved version
should test understanding rather than surface knowledge.
Original: {instruction}
Evolved (deepened):""",
"concretize": """Rewrite the following instruction to be more specific
and concrete. Replace any abstract or vague terms with specific
technologies, datasets, scenarios, or examples.
Original: {instruction}
Evolved (concretized):""",
"increase_reasoning": """Rewrite the following instruction to require
multi-step reasoning, comparison, or synthesis of multiple concepts.
The evolved version should require connecting ideas together.
Original: {instruction}
Evolved (increased reasoning):""",
}
def evol_instruct(
instruction: str,
operation: str,
model: str = "gpt-4o"
) -> str:
"""Apply an evolution operation to an instruction."""
prompt_template = EVOLUTION_OPERATIONS[operation]
prompt = prompt_template.format(instruction=instruction)
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.8,
max_tokens=300
)
return response.choices[0].message.content.strip()
def evolve_instruction_chain(
seed: str,
num_evolutions: int = 3,
model: str = "gpt-4o"
) -> list[dict]:
"""Evolve an instruction through multiple rounds."""
operations = list(EVOLUTION_OPERATIONS.keys())
chain = [{"round": 0, "instruction": seed, "operation": "seed"}]
current = seed
for i in range(num_evolutions):
op = random.choice(operations)
evolved = evol_instruct(current, op, model)
chain.append({
"round": i + 1,
"instruction": evolved,
"operation": op
})
current = evolved
return chain
# Example evolution chain
seed = "Write a function to sort a list."
chain = evolve_instruction_chain(seed, num_evolutions=3)
for step in chain:
print(f"Round {step['round']} ({step['operation']}):")
print(f" {step['instruction'][:80]}...")
print()
3. Multi-Turn Conversation Synthesis
Chat models require multi-turn conversation data where context builds across turns, follow-up questions reference prior answers, and the conversation flows naturally. Generating high quality multi-turn data is considerably harder than generating single-turn instruction pairs because each turn must be coherent with the full conversation history.
def generate_conversation(
topic: str,
persona: str,
num_turns: int = 4,
model: str = "gpt-4o"
) -> list[dict]:
"""Generate a multi-turn conversation with natural follow-ups."""
system_msg = f"""You are simulating a realistic conversation between a
user and an AI assistant. The user has the following persona: {persona}
Topic: {topic}
Generate a natural {num_turns}-turn conversation where:
- Each user message builds on the previous assistant response
- The user asks increasingly specific follow-up questions
- The assistant provides detailed, helpful responses
- The conversation feels natural, not scripted
Format each turn as:
USER: [message]
ASSISTANT: [response]"""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": system_msg}],
temperature=0.85,
max_tokens=2048
)
# Parse turns from the generated conversation
text = response.choices[0].message.content
turns = []
current_role = None
current_text = []
for line in text.split("\n"):
if line.startswith("USER:"):
if current_role:
turns.append({"role": current_role,
"content": "\n".join(current_text).strip()})
current_role = "user"
current_text = [line.replace("USER:", "").strip()]
elif line.startswith("ASSISTANT:"):
if current_role:
turns.append({"role": current_role,
"content": "\n".join(current_text).strip()})
current_role = "assistant"
current_text = [line.replace("ASSISTANT:", "").strip()]
else:
current_text.append(line)
if current_role:
turns.append({"role": current_role,
"content": "\n".join(current_text).strip()})
return turns
# Generate diverse conversations
conversations = [
generate_conversation(
"optimizing PostgreSQL queries",
"junior backend developer with 1 year experience"
),
generate_conversation(
"building a recommendation system",
"data scientist transitioning from academia to industry"
),
]
for i, conv in enumerate(conversations):
print(f"Conversation {i+1}: {len(conv)} turns")
for turn in conv[:2]:
print(f" {turn['role']}: {turn['content'][:60]}...")
4. Persona-Driven Generation
One of the most effective techniques for increasing diversity in synthetic data is persona-driven generation. Instead of generating all data with the same system prompt, you create a library of diverse personas that simulate different users, expertise levels, communication styles, and backgrounds. Each persona produces instructions and conversations that reflect its unique perspective.
import itertools
PERSONA_DIMENSIONS = {
"expertise": ["beginner", "intermediate", "senior", "expert"],
"role": [
"software engineer", "data scientist", "product manager",
"student", "researcher", "DevOps engineer"
],
"communication_style": [
"concise and direct",
"detailed and thorough",
"casual and conversational",
"formal and precise"
],
"context": [
"working on a startup MVP",
"maintaining a legacy enterprise system",
"preparing for a technical interview",
"writing a research paper",
"building a side project"
]
}
def build_persona(dimensions: dict) -> str:
"""Construct a persona description from dimension choices."""
return (
f"A {dimensions['expertise']}-level {dimensions['role']} who "
f"communicates in a {dimensions['communication_style']} manner. "
f"Currently {dimensions['context']}."
)
def generate_with_persona(persona: str, topic: str) -> dict:
"""Generate an instruction from a specific persona's perspective."""
prompt = f"""You are role-playing as the following persona:
{persona}
Given this persona, write a realistic question or task instruction
that this person would actually ask about: {topic}
The question should reflect the persona's expertise level,
communication style, and current context. Be authentic.
Question:"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.9,
max_tokens=200
)
return {
"persona": persona,
"topic": topic,
"instruction": response.choices[0].message.content.strip()
}
# Generate diverse data by sampling persona combinations
def sample_personas(n: int = 10) -> list[dict]:
"""Sample n diverse persona combinations."""
personas = []
for _ in range(n):
dims = {
key: random.choice(values)
for key, values in PERSONA_DIMENSIONS.items()
}
personas.append(dims)
return personas
persona_samples = sample_personas(5)
for dims in persona_samples:
persona = build_persona(dims)
result = generate_with_persona(persona, "database indexing")
print(f"Persona: {persona[:60]}...")
print(f" Q: {result['instruction'][:70]}...")
print()
With 4 expertise levels, 6 roles, 4 communication styles, and 5 contexts, the persona space contains 480 unique combinations. Even a modest sample of 50 to 100 persona combinations produces significantly more diverse data than a single-persona approach. Studies on the Orca and Phi datasets showed that persona-driven generation improved downstream model performance by 5% to 15% on diversity-sensitive benchmarks.
5. Domain-Specific Generation Strategies
Generic generation pipelines work well for general-purpose instruction data, but domain-specific tasks (medical, legal, financial, scientific) require additional structure. Domain-specific generation uses schema-guided prompting, terminology constraints, and document-grounded generation to produce accurate, specialized data.
| Strategy | Approach | Best For |
|---|---|---|
| Schema-Guided | Provide domain ontology/schema as context | Medical coding, legal classification |
| Document-Grounded | Generate QA pairs from domain documents | Technical documentation, research papers |
| Template + Fill | Domain templates with LLM-filled slots | Clinical notes, financial reports |
| Terminology-Constrained | Enforce domain vocabulary usage | Legal contracts, medical records |
| Expert Review Loop | Generate, expert reviews, regenerate | High-stakes domains with low error tolerance |
def domain_grounded_generation(
document: str,
domain: str,
num_pairs: int = 3,
model: str = "gpt-4o"
) -> list[dict]:
"""Generate QA pairs grounded in a domain document."""
prompt = f"""You are an expert in {domain}. Given the following document,
generate {num_pairs} question-answer pairs that test understanding of the
key concepts. Each question should:
- Be answerable from the document content
- Range from factual recall to analytical reasoning
- Use proper domain terminology
- Be relevant to a practitioner in this field
Document:
{document[:3000]}
Generate exactly {num_pairs} pairs in this format:
Q1: [question]
A1: [detailed answer with references to the document]
Q2: ...
A2: ..."""
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
max_tokens=2048
)
# Parse QA pairs
text = response.choices[0].message.content
pairs = []
lines = text.split("\n")
current_q, current_a = None, None
for line in lines:
if line.startswith("Q") and ":" in line[:4]:
if current_q and current_a:
pairs.append({"question": current_q, "answer": current_a})
current_q = line.split(":", 1)[1].strip()
current_a = None
elif line.startswith("A") and ":" in line[:4]:
current_a = line.split(":", 1)[1].strip()
if current_q and current_a:
pairs.append({"question": current_q, "answer": current_a})
return pairs
# Example: Generate from a technical document
sample_doc = """
PostgreSQL uses a cost-based query optimizer that evaluates multiple
execution plans and selects the one with the lowest estimated cost.
The optimizer considers sequential scan cost, index scan cost, join
strategies (nested loop, hash join, merge join), and statistics
collected by ANALYZE. The work_mem parameter controls how much memory
is available for sort operations before spilling to disk.
"""
pairs = domain_grounded_generation(sample_doc, "database engineering")
for p in pairs:
print(f"Q: {p['question'][:70]}...")
print(f"A: {p['answer'][:70]}...")
print()
6. Preference and Ranking Data Generation
Alignment training methods like RLHF and DPO require preference data: pairs of responses where one is preferred over the other. Generating this data synthetically requires careful design to ensure the quality gap between chosen and rejected responses is realistic (not too obvious, not too subtle).
Avoid trivially distinguishable pairs. If the rejected response is clearly terrible (e.g., random text or completely off-topic), the model learns an easy shortcut rather than developing nuanced preference understanding. The best preference datasets have subtle quality differences: a response that is mostly correct but misses a key detail, or one that is accurate but poorly organized. The UltraFeedback dataset showed that models trained on subtly contrasting pairs outperformed those trained on obvious quality gaps.
📝 Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Self-Instruct bootstraps datasets from small seed sets by using LLMs to generate new instructions, classify them, produce responses, and add survivors back to the pool. This created the Alpaca dataset from just 175 seeds.
- Evol-Instruct progressively increases complexity through five operations (add constraints, deepen, concretize, increase reasoning, complicate input), producing natural difficulty curricula.
- Multi-turn conversation synthesis requires follow-up planners and cross-turn coherence checks to ensure context builds naturally across exchanges.
- Persona-driven generation multiplies diversity by simulating different expertise levels, roles, styles, and contexts. The combinatorial space of personas produces far more varied data than single-prompt approaches.
- Domain-specific pipelines need schema-guided prompting, document grounding, and terminology constraints to produce accurate specialized data.
- Preference data for alignment should have subtle, not obvious, quality differences between chosen and rejected responses to train nuanced preference models.