Generation is easy; curation is where the value lies. A raw synthetic dataset typically contains 20% to 40% examples that are low quality, duplicated, or potentially harmful. The curation pipeline transforms this raw output into a clean, diverse, high quality training set. This section covers the three pillars of data curation: quality scoring (using LLM-as-judge to rate each example), deduplication (removing exact and near-duplicate content at multiple granularities), and filtering (enforcing constraints on length, language, toxicity, and topical relevance). Together, these steps typically improve downstream model performance by 10% to 25% compared to training on uncurated data.
1. Automated Quality Scoring with LLM-as-Judge
The first step in curation is scoring every example in your synthetic dataset on multiple quality dimensions. While human review of every example is impractical at scale, LLM-as-judge provides a reliable proxy that correlates well with human judgments (typically 0.7 to 0.85 Spearman correlation) and can process thousands of examples per hour.
1.1 Multi-Dimensional Scoring Rubric
| Dimension | Scale | What It Measures | Common Failure Modes |
|---|---|---|---|
| Instruction Clarity | 1-5 | Is the instruction unambiguous and well-formed? | Vague tasks, missing context, contradictory requirements |
| Response Quality | 1-5 | Is the response accurate, complete, and well-organized? | Hallucinations, incomplete answers, poor structure |
| Instruction-Response Alignment | 1-5 | Does the response actually answer the instruction? | Topic drift, answering a different question |
| Complexity | 1-5 | How challenging is the instruction? | Trivially simple tasks, overly repetitive patterns |
| Safety | Pass/Fail | Does the content violate safety policies? | Harmful advice, PII leakage, biased content |
import json
from openai import OpenAI
from dataclasses import dataclass
client = OpenAI()
@dataclass
class QualityScore:
instruction_clarity: int
response_quality: int
alignment: int
complexity: int
safety_pass: bool
reasoning: str
@property
def composite(self) -> float:
"""Weighted composite score (excluding safety, which is binary)."""
if not self.safety_pass:
return 0.0
return (
0.20 * self.instruction_clarity +
0.35 * self.response_quality +
0.25 * self.alignment +
0.20 * self.complexity
) / 5.0 # Normalize to 0-1
def score_example(instruction: str, response: str,
model: str = "gpt-4o") -> QualityScore:
"""Score a single instruction-response pair on multiple dimensions."""
prompt = f"""Evaluate this instruction-response pair on the following
dimensions. Think through each dimension carefully before scoring.
INSTRUCTION: {instruction}
RESPONSE: {response}
Score each dimension:
- instruction_clarity (1-5): Is the instruction clear and unambiguous?
- response_quality (1-5): Is the response accurate, complete, well-organized?
- alignment (1-5): Does the response directly address the instruction?
- complexity (1-5): How challenging is the task? (1=trivial, 5=expert-level)
- safety_pass (true/false): Is the content free of harmful/biased material?
Provide your analysis, then scores as JSON:
{{
"reasoning": "your analysis of each dimension",
"instruction_clarity": <1-5>,
"response_quality": <1-5>,
"alignment": <1-5>,
"complexity": <1-5>,
"safety_pass":
}}"""
result = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
response_format={"type": "json_object"}
)
data = json.loads(result.choices[0].message.content)
return QualityScore(**data)
def batch_score_dataset(
dataset: list[dict],
min_composite: float = 0.6,
model: str = "gpt-4o"
) -> tuple[list[dict], list[dict]]:
"""Score and partition a dataset into accepted and rejected examples."""
accepted, rejected = [], []
for example in dataset:
score = score_example(
example["instruction"], example["response"], model
)
example["quality_score"] = score.composite
example["quality_details"] = {
"clarity": score.instruction_clarity,
"quality": score.response_quality,
"alignment": score.alignment,
"complexity": score.complexity,
"safety": score.safety_pass,
}
if score.composite >= min_composite and score.safety_pass:
accepted.append(example)
else:
rejected.append(example)
return accepted, rejected
# Example usage
sample_data = [
{"instruction": "Explain how B-tree indexing works in databases.",
"response": "B-tree indexes organize data in a balanced tree structure "
"where each node can have multiple children. Leaf nodes contain pointers "
"to the actual data rows. Lookups are O(log n) because the tree stays "
"balanced through splits and merges during insertions and deletions."},
{"instruction": "Do something.",
"response": "Sure, I did something."},
]
accepted, rejected = batch_score_dataset(sample_data)
print(f"Accepted: {len(accepted)}, Rejected: {len(rejected)}")
for ex in accepted:
print(f" Score: {ex['quality_score']:.3f} | {ex['instruction'][:50]}...")
Quality scoring should be calibrated against human judgments before use at scale. Score a random sample of 100 to 200 examples with both the LLM judge and human annotators, then compute the correlation. If Spearman correlation falls below 0.65, revise your rubric or switch to a stronger judge model. Re-calibrate periodically as your data distribution changes.
2. Deduplication Strategies
Synthetic data generation frequently produces duplicates and near-duplicates. LLMs tend to generate similar responses to similar prompts, especially at lower temperatures. Even small amounts of duplication can distort the training distribution, causing the model to overfit on repeated patterns. Deduplication operates at three levels of granularity.
2.1 Three Levels of Deduplication
import hashlib
from collections import defaultdict
def exact_dedup(examples: list[dict], key: str = "instruction") -> list[dict]:
"""Remove exact duplicates based on normalized text hash."""
seen = set()
unique = []
for ex in examples:
# Normalize: lowercase, strip whitespace, collapse spaces
normalized = " ".join(ex[key].lower().split())
text_hash = hashlib.sha256(normalized.encode()).hexdigest()
if text_hash not in seen:
seen.add(text_hash)
unique.append(ex)
removed = len(examples) - len(unique)
print(f"Exact dedup: {len(examples)} -> {len(unique)} "
f"({removed} removed, {removed/len(examples)*100:.1f}%)")
return unique
def minhash_dedup(
examples: list[dict],
key: str = "instruction",
num_perm: int = 128,
threshold: float = 0.7
) -> list[dict]:
"""Remove near-duplicates using MinHash with n-gram shingling."""
from datasketch import MinHash, MinHashLSH
lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
minhashes = []
# Create MinHash for each example
for i, ex in enumerate(examples):
mh = MinHash(num_perm=num_perm)
# 3-gram character shingles
text = ex[key].lower()
for j in range(len(text) - 2):
shingle = text[j:j+3]
mh.update(shingle.encode("utf-8"))
minhashes.append(mh)
try:
lsh.insert(f"doc_{i}", mh)
except ValueError:
pass # Duplicate detected by LSH
# Find clusters of similar documents
keep_indices = set()
processed = set()
for i in range(len(examples)):
if i in processed:
continue
similar = lsh.query(minhashes[i])
cluster_indices = [int(s.split("_")[1]) for s in similar]
# Keep the first (or best quality) in each cluster
best = min(cluster_indices) # Keep first occurrence
keep_indices.add(best)
processed.update(cluster_indices)
unique = [examples[i] for i in sorted(keep_indices)]
removed = len(examples) - len(unique)
print(f"MinHash dedup: {len(examples)} -> {len(unique)} "
f"({removed} removed, {removed/len(examples)*100:.1f}%)")
return unique
def semantic_dedup(
examples: list[dict],
key: str = "instruction",
threshold: float = 0.92,
model: str = "text-embedding-3-small"
) -> list[dict]:
"""Remove semantic duplicates using embedding similarity."""
import numpy as np
# Get embeddings
texts = [ex[key] for ex in examples]
response = client.embeddings.create(model=model, input=texts)
embeddings = np.array([e.embedding for e in response.data])
# Normalize for cosine similarity
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / norms
# Find pairs above threshold
similarity_matrix = normalized @ normalized.T
keep = set(range(len(examples)))
for i in range(len(examples)):
if i not in keep:
continue
for j in range(i + 1, len(examples)):
if j not in keep:
continue
if similarity_matrix[i][j] > threshold:
keep.discard(j) # Remove the later duplicate
unique = [examples[i] for i in sorted(keep)]
removed = len(examples) - len(unique)
print(f"Semantic dedup: {len(examples)} -> {len(unique)} "
f"({removed} removed, {removed/len(examples)*100:.1f}%)")
return unique
Order your deduplication stages by cost. Always run exact dedup first (near zero cost), then MinHash near-dedup (cheap), and semantic dedup last (requires embedding API calls). Running semantic dedup on the full dataset before filtering exact duplicates wastes significant compute on comparisons that would have been caught by a simple hash check.
3. Multi-Dimensional Filtering
After deduplication, filtering removes examples that fail quality checks along specific dimensions: text length, language, toxicity, topic relevance, and formatting. Each filter targets a specific failure mode that would degrade training quality.
3.1 Filtering Pipeline
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class FilterResult:
passed: bool
reason: str = ""
@dataclass
class FilterPipeline:
"""Configurable pipeline of data quality filters."""
filters: list[tuple[str, Callable]] = field(default_factory=list)
def add_filter(self, name: str, fn: Callable):
self.filters.append((name, fn))
return self # Allow chaining
def run(self, examples: list[dict]) -> tuple[list[dict], dict]:
"""Run all filters, returning accepted examples and statistics."""
stats = {name: 0 for name, _ in self.filters}
accepted = []
for ex in examples:
passed_all = True
for name, fn in self.filters:
result = fn(ex)
if not result.passed:
stats[name] += 1
passed_all = False
break # Fail fast on first filter failure
if passed_all:
accepted.append(ex)
total = len(examples)
print(f"Filtering: {total} -> {len(accepted)} accepted")
for name, count in stats.items():
print(f" {name}: {count} removed ({count/total*100:.1f}%)")
return accepted, stats
# Define individual filters
def length_filter(ex: dict, min_tokens: int = 20,
max_tokens: int = 2048) -> FilterResult:
"""Filter by response length in approximate tokens."""
token_count = len(ex.get("response", "").split()) * 1.3 # rough estimate
if token_count < min_tokens:
return FilterResult(False, f"Too short: ~{token_count:.0f} tokens")
if token_count > max_tokens:
return FilterResult(False, f"Too long: ~{token_count:.0f} tokens")
return FilterResult(True)
def quality_score_filter(ex: dict,
min_score: float = 0.6) -> FilterResult:
"""Filter by pre-computed quality score."""
score = ex.get("quality_score", 0)
if score < min_score:
return FilterResult(False, f"Low quality: {score:.3f}")
return FilterResult(True)
def repetition_filter(ex: dict,
max_repeat_ratio: float = 0.3) -> FilterResult:
"""Filter responses with excessive repeated phrases."""
response = ex.get("response", "")
words = response.lower().split()
if len(words) < 10:
return FilterResult(True)
# Check for repeated 4-grams
ngrams = [" ".join(words[i:i+4]) for i in range(len(words) - 3)]
from collections import Counter
counts = Counter(ngrams)
max_count = max(counts.values()) if counts else 0
repeat_ratio = max_count / len(ngrams) if ngrams else 0
if repeat_ratio > max_repeat_ratio:
return FilterResult(False, f"Repetitive: {repeat_ratio:.2f} ratio")
return FilterResult(True)
# Build and run the pipeline
pipeline = FilterPipeline()
pipeline.add_filter("length", length_filter)
pipeline.add_filter("quality", quality_score_filter)
pipeline.add_filter("repetition", repetition_filter)
# Example run (would normally be on thousands of examples)
sample = [
{"instruction": "Explain Docker", "response": "Docker is...",
"quality_score": 0.3}, # Too short + low quality
{"instruction": "Explain K8s", "response": "Kubernetes is an " * 100,
"quality_score": 0.7}, # Repetitive
{"instruction": "Explain REST APIs",
"response": "REST APIs use HTTP methods to expose resources. "
"GET retrieves data, POST creates new resources, PUT updates "
"existing ones, and DELETE removes them. RESTful design follows "
"principles like statelessness and uniform interfaces.",
"quality_score": 0.85}, # Good
]
accepted, stats = pipeline.run(sample)
4. Argilla for Human-in-the-Loop Curation
Argilla is an open-source data curation platform designed specifically for NLP and LLM data workflows. It provides a web UI for reviewing, annotating, and correcting synthetic data, combined with Python SDK integration for programmatic workflows. Argilla bridges the gap between automated quality scoring and human judgment by presenting borderline examples to human reviewers.
import argilla as rg
# Initialize Argilla client
rg.init(api_url="http://localhost:6900", api_key="argilla.apikey")
# Create a dataset for reviewing synthetic data quality
settings = rg.Settings(
guidelines="Review synthetic instruction-response pairs for quality. "
"Score each dimension 1-5 and flag any safety concerns.",
fields=[
rg.TextField(name="instruction", title="Instruction"),
rg.TextField(name="response", title="Response"),
],
questions=[
rg.RatingQuestion(
name="instruction_clarity",
title="Instruction Clarity",
description="Is the instruction clear and unambiguous?",
values=[1, 2, 3, 4, 5]
),
rg.RatingQuestion(
name="response_quality",
title="Response Quality",
description="Is the response accurate, complete, and well-organized?",
values=[1, 2, 3, 4, 5]
),
rg.LabelQuestion(
name="safety",
title="Safety Check",
labels=["safe", "unsafe", "borderline"]
),
rg.TextQuestion(
name="comments",
title="Comments",
description="Any notes about this example?",
required=False
),
],
metadata=[
rg.FloatMetadataProperty(name="llm_quality_score",
title="LLM Quality Score"),
rg.TermsMetadataProperty(name="source",
title="Generation Source"),
],
)
dataset = rg.Dataset(name="synthetic_data_review", settings=settings)
dataset.create()
# Upload synthetic examples for human review
records = [
rg.Record(
fields={
"instruction": "Explain the CAP theorem in distributed systems.",
"response": "The CAP theorem states that a distributed system "
"can provide at most two of three guarantees: Consistency, "
"Availability, and Partition tolerance..."
},
metadata={
"llm_quality_score": 0.82,
"source": "self-instruct",
},
),
]
dataset.records.log(records)
print(f"Uploaded {len(records)} records for review")
A practical workflow routes only borderline examples to human review: those with LLM quality scores between 0.5 and 0.7. Examples above 0.7 are auto-accepted, and examples below 0.5 are auto-rejected. This focuses expensive human attention where it adds the most value and can reduce review volume by 60% to 80% while maintaining high data quality.
5. Distilabel for Production Pipelines
Distilabel is an open-source framework by Argilla (Hugging Face) for building scalable synthetic data generation and curation pipelines. It provides pre-built components for common generation patterns (Self-Instruct, Evol-Instruct, UltraFeedback-style scoring) and handles batching, rate limiting, and error recovery automatically.
from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.llms import OpenAILLM
# Build a Distilabel pipeline for synthetic data generation + scoring
pipeline = Pipeline(name="synthetic-data-pipeline")
# Step 1: Generate responses to seed instructions
generate = TextGeneration(
name="generate_response",
llm=OpenAILLM(model="gpt-4o-mini"),
system_prompt="You are a knowledgeable assistant. Provide detailed, "
"accurate responses to technical questions.",
num_generations=3, # Generate 3 candidates per instruction
)
# Step 2: Score the generated responses using UltraFeedback criteria
score = UltraFeedback(
name="quality_scoring",
llm=OpenAILLM(model="gpt-4o"),
aspect="overall-rating", # Score on overall quality 1-5
)
# Wire the pipeline
pipeline.add_step(generate)
pipeline.add_step(score, input_mappings={"instruction": "instruction"})
# Run on seed data
seed_instructions = [
{"instruction": "Explain how garbage collection works in Python."},
{"instruction": "What are the tradeoffs between SQL and NoSQL databases?"},
{"instruction": "Describe the observer pattern with a practical example."},
]
# In production, run with:
# results = pipeline.run(seed_instructions)
# results.to_pandas() # Analyze in pandas
# results.push_to_hub("my-org/synthetic-dataset") # Push to HF Hub
print("Pipeline configured with generate + score steps")
print(f"Seed instructions: {len(seed_instructions)}")
print(f"Expected outputs: {len(seed_instructions) * 3} candidates, scored")
📝 Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Quality scoring with LLM-as-judge evaluates each example on instruction clarity, response quality, alignment, complexity, and safety. Calibrate against human judgments (target Spearman correlation above 0.65) before scaling.
- Deduplication operates at three levels: exact hash matching (cheapest, catches 5% to 15%), MinHash near-duplicate detection (catches 10% to 25%), and semantic embedding similarity (catches 15% to 35%). Always run them in order from cheapest to most expensive.
- Multi-dimensional filtering removes examples that fail length, language, toxicity, PII, topic, or repetition checks. Each filter targets a specific failure mode that would degrade training quality.
- Argilla provides human-in-the-loop curation with a web UI for reviewing borderline examples. Routing only borderline cases (quality scores 0.5 to 0.7) to human review reduces volume by 60% to 80%.
- Distilabel automates the full pipeline from generation through scoring, with pre-built components, rate limiting, and Hugging Face Hub integration.
- Curation typically improves downstream performance by 10% to 25% compared to training on uncurated synthetic data.