The internet is humanity's attic: priceless treasures buried under mountains of junk. Data curation is the art of sorting through it all without accidentally teaching your model to sell cryptocurrency.
A Weary Data JanitorData is the foundation of every LLM. While architecture and scaling get the headlines, data quality is often the single largest determinant of model quality. A well-curated 1T token dataset will produce a better model than a poorly filtered 10T token dataset. This section covers the full data curation pipeline: where the data comes from, how duplicates are removed, how quality is assessed, how domain proportions are balanced, and how to handle toxic or private content at web scale.
This section is largely self-contained. Familiarity with tokenization from Module 02 helps for understanding token counts. The discussion of perplexity-based filtering assumes basic language model concepts from Section 1.1.
1. Pre-training Data Sources
Modern LLMs are trained on a mixture of data drawn from several broad categories. The composition of this mixture profoundly influences the model's strengths and weaknesses.
| Source | Scale | Quality | Key Datasets |
|---|---|---|---|
| Web crawl | Trillions of tokens | Variable (noisy) | Common Crawl, FineWeb, DCLM |
| Books | Tens of billions | High | Books3, Project Gutenberg |
| Code | Hundreds of billions | Medium to high | The Stack, GitHub |
| Scientific papers | Tens of billions | High | Semantic Scholar, arXiv |
| Wikipedia | ~4B tokens (English) | Very high | Wikipedia dumps |
| Curated web | Hundreds of billions | High (filtered) | RedPajama, Dolma |
Common Crawl is by far the largest publicly available source, containing petabytes of raw HTML from billions of web pages collected since 2008. However, raw Common Crawl is overwhelmingly low quality: advertisements, boilerplate navigation text, spam, pornography, and machine-generated content dominate. Turning this raw crawl into useful training data requires a sophisticated curation pipeline.
2. The Data Curation Pipeline
3. Text Extraction
Raw HTML must be converted to clean text before any further processing. This is harder than it sounds. Web pages contain navigation menus, sidebars, advertisements, cookie banners, JavaScript artifacts, and boilerplate footers that vastly outnumber the actual content. Tools like trafilatura and resiliparse use structural heuristics to identify the main content block and strip everything else. The FineWeb project demonstrated that switching from the simple jusText extractor to trafilatura produced measurably better downstream performance.
4. Deduplication
Web crawl data is extraordinarily redundant. The same news article, boilerplate legal text, or copied content may appear thousands of times. Duplicate data wastes training compute, biases the model toward overrepresented content, and increases memorization risk. Deduplication operates at three levels of granularity.
Exact Deduplication
The simplest approach: compute a hash (MD5, SHA-256) of each document and discard exact duplicates. This is fast but misses near-duplicates that differ by a few characters (timestamps, bylines, formatting).
Near-Duplicate Detection with MinHash
MinHash with Locality-Sensitive Hashing (LSH) is the standard technique for finding near-duplicate documents at scale. The core idea: represent each document as a set of n-grams, compute a compact signature (MinHash), and use LSH to efficiently find documents with high Jaccard similarity.
import hashlib from collections import defaultdict def get_ngrams(text, n=5): """Extract character n-grams from text.""" words = text.lower().split() return set( " ".join(words[i:i+n]) for i in range(len(words) - n + 1) ) def minhash_signature(ngrams, num_hashes=128): """Compute MinHash signature for a set of n-grams.""" signature = [] for i in range(num_hashes): min_hash = float('inf') for ngram in ngrams: # Hash with a different seed for each hash function h = int(hashlib.sha256( f"{i}:{ngram}".encode() ).hexdigest(), 16) min_hash = min(min_hash, h) signature.append(min_hash) return signature def lsh_buckets(signature, bands=16): """Split signature into bands for LSH bucketing.""" rows_per_band = len(signature) // bands buckets = [] for b in range(bands): start = b * rows_per_band band_hash = hash(tuple(signature[start:start + rows_per_band])) buckets.append((b, band_hash)) return buckets # Example: find near-duplicates docs = [ "The quick brown fox jumps over the lazy dog in the park", "The quick brown fox jumps over a lazy dog in the park", # near-dup "Machine learning models require large amounts of data", ] bucket_index = defaultdict(list) for doc_id, doc in enumerate(docs): ngrams = get_ngrams(doc, n=3) sig = minhash_signature(ngrams, num_hashes=64) for bucket in lsh_buckets(sig, bands=8): bucket_index[bucket].append(doc_id) # Find candidate pairs that share a bucket candidates = set() for docs_in_bucket in bucket_index.values(): if len(docs_in_bucket) > 1: for i in range(len(docs_in_bucket)): for j in range(i+1, len(docs_in_bucket)): candidates.add((docs_in_bucket[i], docs_in_bucket[j])) print(f"Near-duplicate candidates: {candidates}")
Substring-Level Deduplication
Document-level deduplication misses repeated paragraphs that appear across otherwise unique documents (e.g., license headers, terms of service). Substring deduplication uses suffix arrays to find repeated sequences of n or more tokens that appear in multiple documents, then removes all but one occurrence. The RefinedWeb and DCLM datasets demonstrated that substring deduplication consistently improves model quality.
5. Quality Filtering
Key Insight: 97% of the internet is not worth training on. A typical web crawl starts at 100+ TB of raw HTML. After deduplication, quality filtering, and domain mixing, the final training dataset is often 3 TB or less. The vast majority of web content is boilerplate, spam, near-duplicates, or low-quality text that would degrade model performance.
After deduplication, the remaining text still varies enormously in quality. Quality filtering separates informative, well-written content from spam, gibberish, and low-effort text. Three complementary strategies are commonly used together.
Heuristic Filters
Rule-based filters are fast and interpretable. Common heuristics include removing documents that are too short (under 100 words), have excessive punctuation or capitalization ratios, contain too many URLs or special characters, have an abnormally low alphabetic character ratio, or have too many repeated lines or paragraphs.
Perplexity-Based Filtering
A small language model (often a KenLM n-gram model trained on Wikipedia) is used to score each document's perplexity. Documents with very high perplexity (incoherent text) or very low perplexity (repetitive boilerplate) are discarded. The CCNet pipeline introduced this approach and demonstrated significant quality improvements.
Classifier-Based Filtering
A binary classifier is trained to distinguish "high-quality" text (e.g., Wikipedia, books) from "low-quality" text (random web samples). The FineWeb-Edu dataset used a quality classifier trained on educational content annotations to produce a subset of FineWeb specifically optimized for knowledge-intensive tasks. DCLM used a fastText classifier trained on references from Wikipedia and OpenWebText to score documents on a quality scale.
Quality filtering is the highest-leverage intervention in data curation. The FineWeb project showed that aggressive quality filtering on Common Crawl can match the performance of curated datasets like C4 and The Pile, despite starting from much noisier raw material. The DCLM project demonstrated that a well-trained quality classifier can improve benchmark performance by several percentage points over heuristic filtering alone.
# Minimal quality filtering pipeline import re from collections import Counter def heuristic_quality_filter(doc: str) -> dict: """Apply heuristic quality filters to a document.""" words = doc.split() lines = doc.strip().split("\n") chars = len(doc) word_count = len(words) # Length check if word_count < 50: return {"pass": False, "reason": "too_short"} # Alphabetic character ratio alpha_ratio = sum(c.isalpha() for c in doc) / max(chars, 1) if alpha_ratio < 0.6: return {"pass": False, "reason": "low_alpha"} # Repeated line ratio (boilerplate detection) line_counts = Counter(lines) repeated = sum(c - 1 for c in line_counts.values() if c > 1) if repeated / max(len(lines), 1) > 0.3: return {"pass": False, "reason": "repetitive"} # URL density (spam detection) url_count = len(re.findall(r"https?://", doc)) if url_count / max(word_count, 1) > 0.1: return {"pass": False, "reason": "url_heavy"} return {"pass": True, "words": word_count, "alpha": round(alpha_ratio, 3)} # Test on sample documents samples = [ ("Good article", "The transformer architecture has revolutionized NLP by enabling parallel attention. " * 8), ("Too short", "Click here now"), ("Spam", "Visit https://a.com and https://b.com and https://c.com now " * 5), ("Repetitive", ("Buy now!\n" * 20) + "Some padding text to reach minimum length for a valid document check."), ] for label, doc in samples: result = heuristic_quality_filter(doc) print(f"{label:>15}: {result}")
The FineWeb dataset from Hugging Face represents the state of the art in open data curation. Starting from 96 Common Crawl snapshots (over 100 TB of raw HTML), the team applied URL filtering, language identification, MinHash deduplication, and quality scoring to produce a 15 trillion token English corpus. Its educational subset, FineWeb-Edu, further classifies documents by educational value using a classifier trained on LLM annotations. Models trained on FineWeb-Edu outperform those trained on full FineWeb by 2 to 4 points on knowledge benchmarks.
6. Data Mixing and Domain Proportions
The final training corpus is typically a weighted mixture of data from different domains. The proportions of this mixture significantly affect the model's capabilities. For instance, increasing the fraction of code data improves reasoning and structured output abilities. Increasing the fraction of scientific text improves factual knowledge.
The optimal mixing proportions are typically found through ablation experiments on smaller models. DoReMi (Xie et al., 2023) proposed an automated approach: train a small proxy model with uniform mixing, then use the distribution of training loss across domains to reweight the mixture, upsampling domains where the model struggles and downsampling domains that are already well-learned.
# Simplified domain mixing with weighted sampling import numpy as np # Domain proportions (must sum to 1.0) domain_weights = { "web": 0.55, "code": 0.15, "books": 0.10, "wikipedia": 0.05, "scientific": 0.08, "math": 0.04, "conversation": 0.03, } def sample_batch(domain_weights, batch_size=1024): """Sample a training batch according to domain proportions.""" domains = list(domain_weights.keys()) probs = list(domain_weights.values()) # Each sample in the batch comes from a domain batch_domains = np.random.choice(domains, size=batch_size, p=probs) counts = {d: int((batch_domains == d).sum()) for d in domains} return counts batch = sample_batch(domain_weights) for domain, count in sorted(batch.items(), key=lambda x: -x[1]): bar = "#" * (count // 10) print(f" {domain:<15} {count:4d} samples {bar}")
7. Toxicity and PII Removal
Pre-training data must be filtered for toxic content (hate speech, explicit material, harassment) and personally identifiable information (PII) such as phone numbers, email addresses, and social security numbers. Toxicity classifiers like the Jigsaw Perspective API or custom fastText models flag documents above a toxicity threshold. PII removal typically uses regex patterns for structured identifiers combined with named entity recognition for names and addresses.
Overly aggressive toxicity filtering can remove legitimate content about sensitive topics (medical discussions, legal cases, historical events) and hurt the model's ability to understand and reason about these subjects. Most pipelines use a threshold rather than a binary filter, removing only the most toxic content while preserving borderline cases.
8. Data Pruning and Influence Functions
Beyond filtering for quality, recent research explores selecting the most informative training examples. Data pruning removes redundant or uninformative samples to train on a smaller, higher-quality subset without sacrificing performance.
Influence functions provide a principled approach: they estimate how much each training example contributes to the model's performance on a validation set. Formally, the influence of training example zi on the loss at validation point ztest is:
where Hθ is the Hessian of the training loss. Computing exact influence functions is prohibitively expensive for large models (the Hessian has N2 entries), so practical approaches use approximations such as LiSSA (Linear time Stochastic Second-Order Algorithm) or track gradient statistics during training as proxies for influence.
9. Synthetic Data for Pre-training
As high-quality natural text becomes scarce, synthetic data generated by LLMs themselves has become an increasingly important part of the pre-training pipeline. The key insight is that a capable model can generate high-quality training data for a less capable model (or even for future training stages of itself).
Microsoft's Phi-3 and Phi-4 models demonstrated this powerfully: they were trained substantially on synthetic "textbook-quality" data generated by larger models. Despite being small (3.8B parameters), Phi-3 rivaled models many times its size on reasoning benchmarks, largely because its training data was exceptionally high quality. The key was generating structured, educational content (explanations, worked examples, reasoning chains) rather than simply paraphrasing web text.
Quality control for synthetic data is critical. Without careful filtering, model-generated text can amplify biases, introduce subtle errors, or create distribution collapse (where the synthetic data converges to a narrow mode of the generating model). Effective strategies include:
- Verification: Use a separate model or tool to check factual claims in generated text
- Diversity enforcement: Vary prompts, topics, and generation parameters to prevent mode collapse
- Human curation: Use human reviewers to audit a sample of synthetic data for quality
- Mixing: Blend synthetic data with natural data rather than using it exclusively
Epoch AI projects that high-quality text data available on the public internet will be effectively exhausted by 2026 to 2028. Current large models already train on significant fractions of all available web text. This "data wall" is driving three parallel responses: (1) synthetic data generation at scale, (2) multimodal training that incorporates images, video, and audio alongside text, and (3) improved data efficiency through better curation and deduplication. Legal challenges around copyrighted training data add further pressure. How the field navigates this constraint will shape the next generation of LLMs.
10. Major Open Datasets
| Dataset | Size | Key Feature |
|---|---|---|
| The Pile | 825 GB | 22 diverse subsets, academic focus |
| RedPajama v2 | 30T tokens (raw) | Open reproduction of LLaMA data |
| FineWeb | 15T tokens | Best open Common Crawl processing |
| FineWeb-Edu | 1.3T tokens | Educational quality subset |
| DCLM | 4T tokens | Classifier-curated, strong baselines |
| Dolma | 3T tokens | Open, for OLMo model family |
| The Stack v2 | 67.5TB code | Permissively licensed source code |
Check Your Understanding
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Data quality dominates data quantity. Aggressive curation of a smaller dataset often outperforms naive scaling of a larger one.
- Deduplication at document and substring levels is essential. MinHash with LSH is the standard for finding near-duplicates at scale.
- Quality filtering combines heuristic rules, perplexity scoring, and trained classifiers. Classifier-based filtering (FineWeb-Edu, DCLM) provides the largest quality improvements.
- Domain mixing proportions directly shape model capabilities. Increasing code and math data improves reasoning; over-representing web text dilutes specialized knowledge.
- Toxicity and PII removal require careful calibration to avoid removing legitimate content about sensitive topics.
- Data pruning and influence functions represent the frontier of data curation, selecting the most informative examples for training.