Section 12.4: Quality Assurance & Data Curation

★ Big Picture

Generation is easy; curation is where the value lies. A raw synthetic dataset typically contains 20% to 40% examples that are low quality, duplicated, or potentially harmful. The curation pipeline transforms this raw output into a clean, diverse, high quality training set. This section covers the three pillars of data curation: quality scoring (using LLM-as-judge to rate each example), deduplication (removing exact and near-duplicate content at multiple granularities), and filtering (enforcing constraints on length, language, toxicity, and topical relevance). Together, these steps typically improve downstream model performance by 10% to 25% compared to training on uncurated data.

1. Automated Quality Scoring with LLM-as-Judge

The first step in curation is scoring every example in your synthetic dataset on multiple quality dimensions. While human review of every example is impractical at scale, LLM-as-judge provides a reliable proxy that correlates well with human judgments (typically 0.7 to 0.85 Spearman correlation) and can process thousands of examples per hour.

1.1 Multi-Dimensional Scoring Rubric

Dimension	Scale	What It Measures	Common Failure Modes
Instruction Clarity	1-5	Is the instruction unambiguous and well-formed?	Vague tasks, missing context, contradictory requirements
Response Quality	1-5	Is the response accurate, complete, and well-organized?	Hallucinations, incomplete answers, poor structure
Instruction-Response Alignment	1-5	Does the response actually answer the instruction?	Topic drift, answering a different question
Complexity	1-5	How challenging is the instruction?	Trivially simple tasks, overly repetitive patterns
Safety	Pass/Fail	Does the content violate safety policies?	Harmful advice, PII leakage, biased content

import json
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class QualityScore:
    instruction_clarity: int
    response_quality: int
    alignment: int
    complexity: int
    safety_pass: bool
    reasoning: str

    @property
    def composite(self) -> float:
        """Weighted composite score (excluding safety, which is binary)."""
        if not self.safety_pass:
            return 0.0
        return (
            0.20 * self.instruction_clarity +
            0.35 * self.response_quality +
            0.25 * self.alignment +
            0.20 * self.complexity
        ) / 5.0  # Normalize to 0-1

def score_example(instruction: str, response: str,
                  model: str = "gpt-4o") -> QualityScore:
    """Score a single instruction-response pair on multiple dimensions."""
    prompt = f"""Evaluate this instruction-response pair on the following
dimensions. Think through each dimension carefully before scoring.

INSTRUCTION: {instruction}

RESPONSE: {response}

Score each dimension:
- instruction_clarity (1-5): Is the instruction clear and unambiguous?
- response_quality (1-5): Is the response accurate, complete, well-organized?
- alignment (1-5): Does the response directly address the instruction?
- complexity (1-5): How challenging is the task? (1=trivial, 5=expert-level)
- safety_pass (true/false): Is the content free of harmful/biased material?

Provide your analysis, then scores as JSON:
{{
  "reasoning": "your analysis of each dimension",
  "instruction_clarity": <1-5>,
  "response_quality": <1-5>,
  "alignment": <1-5>,
  "complexity": <1-5>,
  "safety_pass": 
}}"""

    result = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1,
        response_format={"type": "json_object"}
    )

    data = json.loads(result.choices[0].message.content)
    return QualityScore(**data)

def batch_score_dataset(
    dataset: list[dict],
    min_composite: float = 0.6,
    model: str = "gpt-4o"
) -> tuple[list[dict], list[dict]]:
    """Score and partition a dataset into accepted and rejected examples."""
    accepted, rejected = [], []

    for example in dataset:
        score = score_example(
            example["instruction"], example["response"], model
        )
        example["quality_score"] = score.composite
        example["quality_details"] = {
            "clarity": score.instruction_clarity,
            "quality": score.response_quality,
            "alignment": score.alignment,
            "complexity": score.complexity,
            "safety": score.safety_pass,
        }

        if score.composite >= min_composite and score.safety_pass:
            accepted.append(example)
        else:
            rejected.append(example)

    return accepted, rejected

# Example usage
sample_data = [
    {"instruction": "Explain how B-tree indexing works in databases.",
     "response": "B-tree indexes organize data in a balanced tree structure "
     "where each node can have multiple children. Leaf nodes contain pointers "
     "to the actual data rows. Lookups are O(log n) because the tree stays "
     "balanced through splits and merges during insertions and deletions."},
    {"instruction": "Do something.",
     "response": "Sure, I did something."},
]

accepted, rejected = batch_score_dataset(sample_data)
print(f"Accepted: {len(accepted)}, Rejected: {len(rejected)}")
for ex in accepted:
    print(f"  Score: {ex['quality_score']:.3f} | {ex['instruction'][:50]}...")

Accepted: 1, Rejected: 1 Score: 0.830 | Explain how B-tree indexing works in databases....

★ Key Insight

Quality scoring should be calibrated against human judgments before use at scale. Score a random sample of 100 to 200 examples with both the LLM judge and human annotators, then compute the correlation. If Spearman correlation falls below 0.65, revise your rubric or switch to a stronger judge model. Re-calibrate periodically as your data distribution changes.

2. Deduplication Strategies

Synthetic data generation frequently produces duplicates and near-duplicates. LLMs tend to generate similar responses to similar prompts, especially at lower temperatures. Even small amounts of duplication can distort the training distribution, causing the model to overfit on repeated patterns. Deduplication operates at three levels of granularity.

2.1 Three Levels of Deduplication

Figure 12.4.1: Three levels of deduplication, from fast exact matching to semantic similarity clustering.

import hashlib
from collections import defaultdict

def exact_dedup(examples: list[dict], key: str = "instruction") -> list[dict]:
    """Remove exact duplicates based on normalized text hash."""
    seen = set()
    unique = []

    for ex in examples:
        # Normalize: lowercase, strip whitespace, collapse spaces
        normalized = " ".join(ex[key].lower().split())
        text_hash = hashlib.sha256(normalized.encode()).hexdigest()

        if text_hash not in seen:
            seen.add(text_hash)
            unique.append(ex)

    removed = len(examples) - len(unique)
    print(f"Exact dedup: {len(examples)} -> {len(unique)} "
          f"({removed} removed, {removed/len(examples)*100:.1f}%)")
    return unique

def minhash_dedup(
    examples: list[dict],
    key: str = "instruction",
    num_perm: int = 128,
    threshold: float = 0.7
) -> list[dict]:
    """Remove near-duplicates using MinHash with n-gram shingling."""
    from datasketch import MinHash, MinHashLSH

    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    minhashes = []

    # Create MinHash for each example
    for i, ex in enumerate(examples):
        mh = MinHash(num_perm=num_perm)
        # 3-gram character shingles
        text = ex[key].lower()
        for j in range(len(text) - 2):
            shingle = text[j:j+3]
            mh.update(shingle.encode("utf-8"))
        minhashes.append(mh)

        try:
            lsh.insert(f"doc_{i}", mh)
        except ValueError:
            pass  # Duplicate detected by LSH

    # Find clusters of similar documents
    keep_indices = set()
    processed = set()

    for i in range(len(examples)):
        if i in processed:
            continue
        similar = lsh.query(minhashes[i])
        cluster_indices = [int(s.split("_")[1]) for s in similar]

        # Keep the first (or best quality) in each cluster
        best = min(cluster_indices)  # Keep first occurrence
        keep_indices.add(best)
        processed.update(cluster_indices)

    unique = [examples[i] for i in sorted(keep_indices)]
    removed = len(examples) - len(unique)
    print(f"MinHash dedup: {len(examples)} -> {len(unique)} "
          f"({removed} removed, {removed/len(examples)*100:.1f}%)")
    return unique

def semantic_dedup(
    examples: list[dict],
    key: str = "instruction",
    threshold: float = 0.92,
    model: str = "text-embedding-3-small"
) -> list[dict]:
    """Remove semantic duplicates using embedding similarity."""
    import numpy as np

    # Get embeddings
    texts = [ex[key] for ex in examples]
    response = client.embeddings.create(model=model, input=texts)
    embeddings = np.array([e.embedding for e in response.data])

    # Normalize for cosine similarity
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / norms

    # Find pairs above threshold
    similarity_matrix = normalized @ normalized.T
    keep = set(range(len(examples)))

    for i in range(len(examples)):
        if i not in keep:
            continue
        for j in range(i + 1, len(examples)):
            if j not in keep:
                continue
            if similarity_matrix[i][j] > threshold:
                keep.discard(j)  # Remove the later duplicate

    unique = [examples[i] for i in sorted(keep)]
    removed = len(examples) - len(unique)
    print(f"Semantic dedup: {len(examples)} -> {len(unique)} "
          f"({removed} removed, {removed/len(examples)*100:.1f}%)")
    return unique

⚠ Warning

Order your deduplication stages by cost. Always run exact dedup first (near zero cost), then MinHash near-dedup (cheap), and semantic dedup last (requires embedding API calls). Running semantic dedup on the full dataset before filtering exact duplicates wastes significant compute on comparisons that would have been caught by a simple hash check.

3. Multi-Dimensional Filtering

After deduplication, filtering removes examples that fail quality checks along specific dimensions: text length, language, toxicity, topic relevance, and formatting. Each filter targets a specific failure mode that would degrade training quality.

3.1 Filtering Pipeline

Figure 12.4.2: Multi-dimensional filtering pipeline with typical drop rates at each stage.

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class FilterResult:
    passed: bool
    reason: str = ""

@dataclass
class FilterPipeline:
    """Configurable pipeline of data quality filters."""
    filters: list[tuple[str, Callable]] = field(default_factory=list)

    def add_filter(self, name: str, fn: Callable):
        self.filters.append((name, fn))
        return self  # Allow chaining

    def run(self, examples: list[dict]) -> tuple[list[dict], dict]:
        """Run all filters, returning accepted examples and statistics."""
        stats = {name: 0 for name, _ in self.filters}
        accepted = []

        for ex in examples:
            passed_all = True
            for name, fn in self.filters:
                result = fn(ex)
                if not result.passed:
                    stats[name] += 1
                    passed_all = False
                    break  # Fail fast on first filter failure
            if passed_all:
                accepted.append(ex)

        total = len(examples)
        print(f"Filtering: {total} -> {len(accepted)} accepted")
        for name, count in stats.items():
            print(f"  {name}: {count} removed ({count/total*100:.1f}%)")
        return accepted, stats

# Define individual filters
def length_filter(ex: dict, min_tokens: int = 20,
                  max_tokens: int = 2048) -> FilterResult:
    """Filter by response length in approximate tokens."""
    token_count = len(ex.get("response", "").split()) * 1.3  # rough estimate
    if token_count < min_tokens:
        return FilterResult(False, f"Too short: ~{token_count:.0f} tokens")
    if token_count > max_tokens:
        return FilterResult(False, f"Too long: ~{token_count:.0f} tokens")
    return FilterResult(True)

def quality_score_filter(ex: dict,
                         min_score: float = 0.6) -> FilterResult:
    """Filter by pre-computed quality score."""
    score = ex.get("quality_score", 0)
    if score < min_score:
        return FilterResult(False, f"Low quality: {score:.3f}")
    return FilterResult(True)

def repetition_filter(ex: dict,
                      max_repeat_ratio: float = 0.3) -> FilterResult:
    """Filter responses with excessive repeated phrases."""
    response = ex.get("response", "")
    words = response.lower().split()
    if len(words) < 10:
        return FilterResult(True)

    # Check for repeated 4-grams
    ngrams = [" ".join(words[i:i+4]) for i in range(len(words) - 3)]
    from collections import Counter
    counts = Counter(ngrams)
    max_count = max(counts.values()) if counts else 0
    repeat_ratio = max_count / len(ngrams) if ngrams else 0

    if repeat_ratio > max_repeat_ratio:
        return FilterResult(False, f"Repetitive: {repeat_ratio:.2f} ratio")
    return FilterResult(True)

# Build and run the pipeline
pipeline = FilterPipeline()
pipeline.add_filter("length", length_filter)
pipeline.add_filter("quality", quality_score_filter)
pipeline.add_filter("repetition", repetition_filter)

# Example run (would normally be on thousands of examples)
sample = [
    {"instruction": "Explain Docker", "response": "Docker is...",
     "quality_score": 0.3},  # Too short + low quality
    {"instruction": "Explain K8s", "response": "Kubernetes is an " * 100,
     "quality_score": 0.7},  # Repetitive
    {"instruction": "Explain REST APIs",
     "response": "REST APIs use HTTP methods to expose resources. "
     "GET retrieves data, POST creates new resources, PUT updates "
     "existing ones, and DELETE removes them. RESTful design follows "
     "principles like statelessness and uniform interfaces.",
     "quality_score": 0.85},  # Good
]

accepted, stats = pipeline.run(sample)

Filtering: 3 -> 1 accepted length: 1 removed (33.3%) quality: 0 removed (0.0%) repetition: 1 removed (33.3%)

4. Argilla for Human-in-the-Loop Curation

Argilla is an open-source data curation platform designed specifically for NLP and LLM data workflows. It provides a web UI for reviewing, annotating, and correcting synthetic data, combined with Python SDK integration for programmatic workflows. Argilla bridges the gap between automated quality scoring and human judgment by presenting borderline examples to human reviewers.

import argilla as rg

# Initialize Argilla client
rg.init(api_url="http://localhost:6900", api_key="argilla.apikey")

# Create a dataset for reviewing synthetic data quality
settings = rg.Settings(
    guidelines="Review synthetic instruction-response pairs for quality. "
    "Score each dimension 1-5 and flag any safety concerns.",
    fields=[
        rg.TextField(name="instruction", title="Instruction"),
        rg.TextField(name="response", title="Response"),
    ],
    questions=[
        rg.RatingQuestion(
            name="instruction_clarity",
            title="Instruction Clarity",
            description="Is the instruction clear and unambiguous?",
            values=[1, 2, 3, 4, 5]
        ),
        rg.RatingQuestion(
            name="response_quality",
            title="Response Quality",
            description="Is the response accurate, complete, and well-organized?",
            values=[1, 2, 3, 4, 5]
        ),
        rg.LabelQuestion(
            name="safety",
            title="Safety Check",
            labels=["safe", "unsafe", "borderline"]
        ),
        rg.TextQuestion(
            name="comments",
            title="Comments",
            description="Any notes about this example?",
            required=False
        ),
    ],
    metadata=[
        rg.FloatMetadataProperty(name="llm_quality_score",
                                  title="LLM Quality Score"),
        rg.TermsMetadataProperty(name="source",
                                  title="Generation Source"),
    ],
)

dataset = rg.Dataset(name="synthetic_data_review", settings=settings)
dataset.create()

# Upload synthetic examples for human review
records = [
    rg.Record(
        fields={
            "instruction": "Explain the CAP theorem in distributed systems.",
            "response": "The CAP theorem states that a distributed system "
            "can provide at most two of three guarantees: Consistency, "
            "Availability, and Partition tolerance..."
        },
        metadata={
            "llm_quality_score": 0.82,
            "source": "self-instruct",
        },
    ),
]
dataset.records.log(records)
print(f"Uploaded {len(records)} records for review")

ⓘ Note

A practical workflow routes only borderline examples to human review: those with LLM quality scores between 0.5 and 0.7. Examples above 0.7 are auto-accepted, and examples below 0.5 are auto-rejected. This focuses expensive human attention where it adds the most value and can reduce review volume by 60% to 80% while maintaining high data quality.

5. Distilabel for Production Pipelines

Distilabel is an open-source framework by Argilla (Hugging Face) for building scalable synthetic data generation and curation pipelines. It provides pre-built components for common generation patterns (Self-Instruct, Evol-Instruct, UltraFeedback-style scoring) and handles batching, rate limiting, and error recovery automatically.

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import TextGeneration, UltraFeedback
from distilabel.llms import OpenAILLM

# Build a Distilabel pipeline for synthetic data generation + scoring
pipeline = Pipeline(name="synthetic-data-pipeline")

# Step 1: Generate responses to seed instructions
generate = TextGeneration(
    name="generate_response",
    llm=OpenAILLM(model="gpt-4o-mini"),
    system_prompt="You are a knowledgeable assistant. Provide detailed, "
                  "accurate responses to technical questions.",
    num_generations=3,  # Generate 3 candidates per instruction
)

# Step 2: Score the generated responses using UltraFeedback criteria
score = UltraFeedback(
    name="quality_scoring",
    llm=OpenAILLM(model="gpt-4o"),
    aspect="overall-rating",  # Score on overall quality 1-5
)

# Wire the pipeline
pipeline.add_step(generate)
pipeline.add_step(score, input_mappings={"instruction": "instruction"})

# Run on seed data
seed_instructions = [
    {"instruction": "Explain how garbage collection works in Python."},
    {"instruction": "What are the tradeoffs between SQL and NoSQL databases?"},
    {"instruction": "Describe the observer pattern with a practical example."},
]

# In production, run with:
# results = pipeline.run(seed_instructions)
# results.to_pandas()  # Analyze in pandas
# results.push_to_hub("my-org/synthetic-dataset")  # Push to HF Hub

print("Pipeline configured with generate + score steps")
print(f"Seed instructions: {len(seed_instructions)}")
print(f"Expected outputs: {len(seed_instructions) * 3} candidates, scored")

Pipeline configured with generate + score steps Seed instructions: 3 Expected outputs: 9 candidates, scored

📝 Knowledge Check

1. What are the five quality dimensions used for scoring synthetic data?

Show Answer

The five dimensions are: (1) Instruction Clarity (is the instruction unambiguous and well-formed?), (2) Response Quality (is the response accurate, complete, and well-organized?), (3) Instruction-Response Alignment (does the response actually address the instruction?), (4) Complexity (how challenging is the task?), and (5) Safety (does the content violate safety policies?). The first four are scored on 1-5 scales, while Safety is a binary pass/fail.

2. What are the three levels of deduplication, and why should they be applied in a specific order?

Show Answer

The three levels are: (1) Exact dedup using text hashing (SHA-256), which catches 5% to 15% of data at near-zero cost; (2) Near-duplicate detection using MinHash/SimHash with Jaccard similarity, catching 10% to 25% at low cost; and (3) Semantic dedup using embedding cosine similarity, catching 15% to 35% at moderate cost. They should be applied in this order (cheapest first) because each stage reduces the dataset size, making the more expensive subsequent stages faster and cheaper to run.

3. How does a borderline routing strategy improve the efficiency of human review?

Show Answer

Borderline routing sends only examples with LLM quality scores in the uncertain range (typically 0.5 to 0.7) to human reviewers. Examples above 0.7 are auto-accepted and examples below 0.5 are auto-rejected. This focuses expensive human attention on the cases where automated scoring is least confident and human judgment adds the most value. The approach can reduce human review volume by 60% to 80% while maintaining high overall data quality.

4. What is the repetition filter detecting, and why is it important for synthetic data?

Show Answer

The repetition filter detects responses with excessive repeated phrases by computing the ratio of the most common n-grams to total n-grams. This is important because LLMs sometimes enter degenerate loops where they repeat the same phrases. Training on highly repetitive data teaches the model to produce repetitive outputs. A typical threshold flags responses where any 4-gram appears more than 30% of the time relative to all 4-grams in the response.

5. What advantages does Distilabel provide over building custom generation pipelines from scratch?

Show Answer

Distilabel provides: (1) pre-built components for common patterns like Self-Instruct, Evol-Instruct, and UltraFeedback scoring; (2) automatic handling of batching, rate limiting, and error recovery; (3) integration with multiple LLM providers through a unified interface; (4) built-in support for generating multiple candidates per instruction; (5) direct export to Hugging Face Hub for dataset sharing; and (6) integration with Argilla for human-in-the-loop curation workflows. This saves significant engineering time compared to implementing these features from scratch.

Key Takeaways

Quality scoring with LLM-as-judge evaluates each example on instruction clarity, response quality, alignment, complexity, and safety. Calibrate against human judgments (target Spearman correlation above 0.65) before scaling.
Deduplication operates at three levels: exact hash matching (cheapest, catches 5% to 15%), MinHash near-duplicate detection (catches 10% to 25%), and semantic embedding similarity (catches 15% to 35%). Always run them in order from cheapest to most expensive.
Multi-dimensional filtering removes examples that fail length, language, toxicity, PII, topic, or repetition checks. Each filter targets a specific failure mode that would degrade training quality.
Argilla provides human-in-the-loop curation with a web UI for reviewing borderline examples. Routing only borderline cases (quality scores 0.5 to 0.7) to human review reduces volume by 60% to 80%.
Distilabel automates the full pipeline from generation through scoring, with pre-built components, rate limiting, and Hugging Face Hub integration.
Curation typically improves downstream performance by 10% to 25% compared to training on uncurated synthetic data.