Module 06: Pre-training, Scaling Laws & Data Curation

Chapter Overview

This chapter takes you behind the curtain of modern language model development. While the Transformer architecture (Module 04) provides the blueprint, the real story of LLMs is one of scale: billions of parameters trained on trillions of tokens, consuming thousands of GPU hours. Understanding how this process works is essential for anyone building with or reasoning about these systems.

We begin by surveying the landmark models that shaped the field, from BERT to GPT-4. We then dissect the pre-training objectives that teach models to understand and generate language. Next, we explore the scaling laws that govern how model performance improves with more compute, data, and parameters, and the data curation pipelines that supply the raw material. We cover the optimization algorithms and distributed training infrastructure that make billion-parameter training feasible. Finally, we examine the fascinating theoretical question of how in-context learning actually works inside transformers.

Learning Objectives

Trace the evolution from BERT to GPT-4, identifying the key architectural and training decisions that defined each era
Compare and implement pre-training objectives: causal LM, masked LM, span corruption, fill-in-the-middle, and multi-token prediction
Apply Kaplan and Chinchilla scaling laws to estimate optimal model size and data requirements for a given compute budget
Design a data curation pipeline with deduplication, quality filtering, and domain mixing
Explain how Adam, AdamW, and Adafactor work, and select appropriate learning rate schedules for large-scale training
Distinguish between DDP, FSDP, ZeRO, tensor parallelism, and pipeline parallelism, and select the right strategy for a given hardware setup
Discuss leading theories of in-context learning: meta-learning, implicit gradient descent, and task vectors

Sections

Prerequisites

Solid understanding of the Transformer architecture (Module 04)
Familiarity with attention mechanisms and positional encodings (Module 03)
Basic PyTorch proficiency: training loops, autograd, nn.Module (Module 00)
Understanding of tokenization and subword models (Module 02)
Comfort with basic probability and information theory (cross-entropy, perplexity)