Module 06 · Section 6.4

Data Curation at Scale

How to source, clean, deduplicate, and compose the trillions of tokens that teach a language model

The internet is humanity's attic: priceless treasures buried under mountains of junk. Data curation is the art of sorting through it all without accidentally teaching your model to sell cryptocurrency.

A Weary Data Janitor
★ Big Picture

Data is the foundation of every LLM. While architecture and scaling get the headlines, data quality is often the single largest determinant of model quality. A well-curated 1T token dataset will produce a better model than a poorly filtered 10T token dataset. This section covers the full data curation pipeline: where the data comes from, how duplicates are removed, how quality is assessed, how domain proportions are balanced, and how to handle toxic or private content at web scale.

⚙ Prerequisites

This section is largely self-contained. Familiarity with tokenization from Module 02 helps for understanding token counts. The discussion of perplexity-based filtering assumes basic language model concepts from Section 1.1.

1. Pre-training Data Sources

Modern LLMs are trained on a mixture of data drawn from several broad categories. The composition of this mixture profoundly influences the model's strengths and weaknesses.

Source Scale Quality Key Datasets
Web crawl Trillions of tokens Variable (noisy) Common Crawl, FineWeb, DCLM
Books Tens of billions High Books3, Project Gutenberg
Code Hundreds of billions Medium to high The Stack, GitHub
Scientific papers Tens of billions High Semantic Scholar, arXiv
Wikipedia ~4B tokens (English) Very high Wikipedia dumps
Curated web Hundreds of billions High (filtered) RedPajama, Dolma

Common Crawl is by far the largest publicly available source, containing petabytes of raw HTML from billions of web pages collected since 2008. However, raw Common Crawl is overwhelmingly low quality: advertisements, boilerplate navigation text, spam, pornography, and machine-generated content dominate. Turning this raw crawl into useful training data requires a sophisticated curation pipeline.

2. The Data Curation Pipeline

Data Curation Pipeline Web Crawl ~100TB raw Text Extract trafilatura Dedup MinHash LSH Quality Filter heuristic+ML Safety Filter toxicity + PII Domain Mix proportions 100TB ~20TB text ~12TB unique ~5TB quality ~4TB safe ~3TB final Typical reduction: raw crawl is filtered down to ~3% of original volume
Figure 6.4.1: A typical data curation pipeline showing the progressive filtering from raw web crawl to final training corpus.

3. Text Extraction

Raw HTML must be converted to clean text before any further processing. This is harder than it sounds. Web pages contain navigation menus, sidebars, advertisements, cookie banners, JavaScript artifacts, and boilerplate footers that vastly outnumber the actual content. Tools like trafilatura and resiliparse use structural heuristics to identify the main content block and strip everything else. The FineWeb project demonstrated that switching from the simple jusText extractor to trafilatura produced measurably better downstream performance.

4. Deduplication

Web crawl data is extraordinarily redundant. The same news article, boilerplate legal text, or copied content may appear thousands of times. Duplicate data wastes training compute, biases the model toward overrepresented content, and increases memorization risk. Deduplication operates at three levels of granularity.

Exact Deduplication

The simplest approach: compute a hash (MD5, SHA-256) of each document and discard exact duplicates. This is fast but misses near-duplicates that differ by a few characters (timestamps, bylines, formatting).

Near-Duplicate Detection with MinHash

MinHash with Locality-Sensitive Hashing (LSH) is the standard technique for finding near-duplicate documents at scale. The core idea: represent each document as a set of n-grams, compute a compact signature (MinHash), and use LSH to efficiently find documents with high Jaccard similarity.

import hashlib
from collections import defaultdict

def get_ngrams(text, n=5):
    """Extract character n-grams from text."""
    words = text.lower().split()
    return set(
        " ".join(words[i:i+n])
        for i in range(len(words) - n + 1)
    )

def minhash_signature(ngrams, num_hashes=128):
    """Compute MinHash signature for a set of n-grams."""
    signature = []
    for i in range(num_hashes):
        min_hash = float('inf')
        for ngram in ngrams:
            # Hash with a different seed for each hash function
            h = int(hashlib.sha256(
                f"{i}:{ngram}".encode()
            ).hexdigest(), 16)
            min_hash = min(min_hash, h)
        signature.append(min_hash)
    return signature

def lsh_buckets(signature, bands=16):
    """Split signature into bands for LSH bucketing."""
    rows_per_band = len(signature) // bands
    buckets = []
    for b in range(bands):
        start = b * rows_per_band
        band_hash = hash(tuple(signature[start:start + rows_per_band]))
        buckets.append((b, band_hash))
    return buckets

# Example: find near-duplicates
docs = [
    "The quick brown fox jumps over the lazy dog in the park",
    "The quick brown fox jumps over a lazy dog in the park",  # near-dup
    "Machine learning models require large amounts of data",
]

bucket_index = defaultdict(list)
for doc_id, doc in enumerate(docs):
    ngrams = get_ngrams(doc, n=3)
    sig = minhash_signature(ngrams, num_hashes=64)
    for bucket in lsh_buckets(sig, bands=8):
        bucket_index[bucket].append(doc_id)

# Find candidate pairs that share a bucket
candidates = set()
for docs_in_bucket in bucket_index.values():
    if len(docs_in_bucket) > 1:
        for i in range(len(docs_in_bucket)):
            for j in range(i+1, len(docs_in_bucket)):
                candidates.add((docs_in_bucket[i], docs_in_bucket[j]))
print(f"Near-duplicate candidates: {candidates}")
Near-duplicate candidates: {(0, 1)}

Substring-Level Deduplication

Document-level deduplication misses repeated paragraphs that appear across otherwise unique documents (e.g., license headers, terms of service). Substring deduplication uses suffix arrays to find repeated sequences of n or more tokens that appear in multiple documents, then removes all but one occurrence. The RefinedWeb and DCLM datasets demonstrated that substring deduplication consistently improves model quality.

5. Quality Filtering

⚡ Key Insight

Key Insight: 97% of the internet is not worth training on. A typical web crawl starts at 100+ TB of raw HTML. After deduplication, quality filtering, and domain mixing, the final training dataset is often 3 TB or less. The vast majority of web content is boilerplate, spam, near-duplicates, or low-quality text that would degrade model performance.

After deduplication, the remaining text still varies enormously in quality. Quality filtering separates informative, well-written content from spam, gibberish, and low-effort text. Three complementary strategies are commonly used together.

Heuristic Filters

Rule-based filters are fast and interpretable. Common heuristics include removing documents that are too short (under 100 words), have excessive punctuation or capitalization ratios, contain too many URLs or special characters, have an abnormally low alphabetic character ratio, or have too many repeated lines or paragraphs.

Perplexity-Based Filtering

A small language model (often a KenLM n-gram model trained on Wikipedia) is used to score each document's perplexity. Documents with very high perplexity (incoherent text) or very low perplexity (repetitive boilerplate) are discarded. The CCNet pipeline introduced this approach and demonstrated significant quality improvements.

Classifier-Based Filtering

A binary classifier is trained to distinguish "high-quality" text (e.g., Wikipedia, books) from "low-quality" text (random web samples). The FineWeb-Edu dataset used a quality classifier trained on educational content annotations to produce a subset of FineWeb specifically optimized for knowledge-intensive tasks. DCLM used a fastText classifier trained on references from Wikipedia and OpenWebText to score documents on a quality scale.

⚡ Key Insight

Quality filtering is the highest-leverage intervention in data curation. The FineWeb project showed that aggressive quality filtering on Common Crawl can match the performance of curated datasets like C4 and The Pile, despite starting from much noisier raw material. The DCLM project demonstrated that a well-trained quality classifier can improve benchmark performance by several percentage points over heuristic filtering alone.

# Minimal quality filtering pipeline
import re
from collections import Counter

def heuristic_quality_filter(doc: str) -> dict:
    """Apply heuristic quality filters to a document."""
    words = doc.split()
    lines = doc.strip().split("\n")
    chars = len(doc)
    word_count = len(words)

    # Length check
    if word_count < 50:
        return {"pass": False, "reason": "too_short"}

    # Alphabetic character ratio
    alpha_ratio = sum(c.isalpha() for c in doc) / max(chars, 1)
    if alpha_ratio < 0.6:
        return {"pass": False, "reason": "low_alpha"}

    # Repeated line ratio (boilerplate detection)
    line_counts = Counter(lines)
    repeated = sum(c - 1 for c in line_counts.values() if c > 1)
    if repeated / max(len(lines), 1) > 0.3:
        return {"pass": False, "reason": "repetitive"}

    # URL density (spam detection)
    url_count = len(re.findall(r"https?://", doc))
    if url_count / max(word_count, 1) > 0.1:
        return {"pass": False, "reason": "url_heavy"}

    return {"pass": True, "words": word_count, "alpha": round(alpha_ratio, 3)}

# Test on sample documents
samples = [
    ("Good article", "The transformer architecture has revolutionized NLP by enabling parallel attention. " * 8),
    ("Too short", "Click here now"),
    ("Spam", "Visit https://a.com and https://b.com and https://c.com now " * 5),
    ("Repetitive", ("Buy now!\n" * 20) + "Some padding text to reach minimum length for a valid document check."),
]
for label, doc in samples:
    result = heuristic_quality_filter(doc)
    print(f"{label:>15}: {result}")
Good article: {'pass': True, 'words': 72, 'alpha': 0.901} Too short: {'pass': False, 'reason': 'too_short'} Spam: {'pass': False, 'reason': 'url_heavy'} Repetitive: {'pass': False, 'reason': 'repetitive'}
📰 Paper Spotlight: FineWeb (Penedo et al., 2024)

The FineWeb dataset from Hugging Face represents the state of the art in open data curation. Starting from 96 Common Crawl snapshots (over 100 TB of raw HTML), the team applied URL filtering, language identification, MinHash deduplication, and quality scoring to produce a 15 trillion token English corpus. Its educational subset, FineWeb-Edu, further classifies documents by educational value using a classifier trained on LLM annotations. Models trained on FineWeb-Edu outperform those trained on full FineWeb by 2 to 4 points on knowledge benchmarks.

6. Data Mixing and Domain Proportions

The final training corpus is typically a weighted mixture of data from different domains. The proportions of this mixture significantly affect the model's capabilities. For instance, increasing the fraction of code data improves reasoning and structured output abilities. Increasing the fraction of scientific text improves factual knowledge.

The optimal mixing proportions are typically found through ablation experiments on smaller models. DoReMi (Xie et al., 2023) proposed an automated approach: train a small proxy model with uniform mixing, then use the distribution of training loss across domains to reweight the mixture, upsampling domains where the model struggles and downsampling domains that are already well-learned.

# Simplified domain mixing with weighted sampling
import numpy as np

# Domain proportions (must sum to 1.0)
domain_weights = {
    "web":         0.55,
    "code":        0.15,
    "books":       0.10,
    "wikipedia":   0.05,
    "scientific":  0.08,
    "math":        0.04,
    "conversation": 0.03,
}

def sample_batch(domain_weights, batch_size=1024):
    """Sample a training batch according to domain proportions."""
    domains = list(domain_weights.keys())
    probs = list(domain_weights.values())
    # Each sample in the batch comes from a domain
    batch_domains = np.random.choice(domains, size=batch_size, p=probs)
    counts = {d: int((batch_domains == d).sum()) for d in domains}
    return counts

batch = sample_batch(domain_weights)
for domain, count in sorted(batch.items(), key=lambda x: -x[1]):
    bar = "#" * (count // 10)
    print(f"  {domain:<15} {count:4d} samples  {bar}")
web 563 samples ######################################################## code 154 samples ############### books 100 samples ########## scientific 82 samples ######## wikipedia 50 samples ##### math 33 samples ### conversation 42 samples ####

7. Toxicity and PII Removal

Pre-training data must be filtered for toxic content (hate speech, explicit material, harassment) and personally identifiable information (PII) such as phone numbers, email addresses, and social security numbers. Toxicity classifiers like the Jigsaw Perspective API or custom fastText models flag documents above a toxicity threshold. PII removal typically uses regex patterns for structured identifiers combined with named entity recognition for names and addresses.

⚠ Tension: Safety vs. Capability

Overly aggressive toxicity filtering can remove legitimate content about sensitive topics (medical discussions, legal cases, historical events) and hurt the model's ability to understand and reason about these subjects. Most pipelines use a threshold rather than a binary filter, removing only the most toxic content while preserving borderline cases.

8. Data Pruning and Influence Functions

Beyond filtering for quality, recent research explores selecting the most informative training examples. Data pruning removes redundant or uninformative samples to train on a smaller, higher-quality subset without sacrificing performance.

Influence functions provide a principled approach: they estimate how much each training example contributes to the model's performance on a validation set. Formally, the influence of training example zi on the loss at validation point ztest is:

I(zi, ztest) = −∇θL(ztest)T Hθ−1θL(zi)

where Hθ is the Hessian of the training loss. Computing exact influence functions is prohibitively expensive for large models (the Hessian has N2 entries), so practical approaches use approximations such as LiSSA (Linear time Stochastic Second-Order Algorithm) or track gradient statistics during training as proxies for influence.

9. Synthetic Data for Pre-training

As high-quality natural text becomes scarce, synthetic data generated by LLMs themselves has become an increasingly important part of the pre-training pipeline. The key insight is that a capable model can generate high-quality training data for a less capable model (or even for future training stages of itself).

Microsoft's Phi-3 and Phi-4 models demonstrated this powerfully: they were trained substantially on synthetic "textbook-quality" data generated by larger models. Despite being small (3.8B parameters), Phi-3 rivaled models many times its size on reasoning benchmarks, largely because its training data was exceptionally high quality. The key was generating structured, educational content (explanations, worked examples, reasoning chains) rather than simply paraphrasing web text.

Quality control for synthetic data is critical. Without careful filtering, model-generated text can amplify biases, introduce subtle errors, or create distribution collapse (where the synthetic data converges to a narrow mode of the generating model). Effective strategies include:

🌱 Open Problem: The Data Wall

Epoch AI projects that high-quality text data available on the public internet will be effectively exhausted by 2026 to 2028. Current large models already train on significant fractions of all available web text. This "data wall" is driving three parallel responses: (1) synthetic data generation at scale, (2) multimodal training that incorporates images, video, and audio alongside text, and (3) improved data efficiency through better curation and deduplication. Legal challenges around copyrighted training data add further pressure. How the field navigates this constraint will shape the next generation of LLMs.

10. Major Open Datasets

Dataset Size Key Feature
The Pile 825 GB 22 diverse subsets, academic focus
RedPajama v2 30T tokens (raw) Open reproduction of LLaMA data
FineWeb 15T tokens Best open Common Crawl processing
FineWeb-Edu 1.3T tokens Educational quality subset
DCLM 4T tokens Classifier-curated, strong baselines
Dolma 3T tokens Open, for OLMo model family
The Stack v2 67.5TB code Permissively licensed source code

Check Your Understanding

1. Why is deduplication critical for LLM training quality?
Show Answer
Duplicate data wastes training compute by showing the model the same content multiple times without new information. It also biases the model toward memorizing overrepresented text (especially templated content like legal boilerplate), increases the risk of verbatim memorization (a privacy and copyright concern), and distorts the effective domain mixture. Deduplication at both document and substring levels consistently improves downstream benchmark performance.
2. How does MinHash approximate Jaccard similarity?
Show Answer
MinHash exploits the mathematical property that the probability of two sets having the same minimum hash value under a random hash function equals their Jaccard similarity (|A ∩ B| / |A ∪ B|). By applying many independent hash functions and recording the minimum hash for each, you get a compact signature. The fraction of matching entries between two signatures estimates their Jaccard similarity. LSH then groups documents into buckets by banding the signature, so only documents with high similarity end up in the same bucket, avoiding the need for all-pairs comparison.
3. Explain the tradeoff between aggressive quality filtering and data diversity.
Show Answer
Aggressive quality filtering (e.g., keeping only Wikipedia-like text) produces very clean data but reduces diversity. The model may become excellent at formal, encyclopedic prose but struggle with informal text, dialogue, slang, technical jargon, or code. It can also introduce bias by favoring well-represented languages, topics, and perspectives. The best pipelines use tiered filtering: remove clearly low-quality content with heuristics, then apply softer quality scoring rather than hard thresholds, preserving a range of styles and topics.
4. Why are influence functions impractical to compute exactly for modern LLMs?
Show Answer
Influence functions require computing the inverse Hessian matrix of the training loss, which has dimensions N x N where N is the number of model parameters. For a 7B parameter model, this matrix would have 49 x 1018 entries, requiring exabytes of memory. Even the Hessian-vector products needed for iterative approximations are expensive at this scale. Practical alternatives use gradient similarity proxies, track loss statistics during training, or apply influence function approximations to smaller proxy models and transfer the findings.

Key Takeaways