Every few months someone trains a model so large it makes the previous one look like a pocket calculator. And every few months, the previous one is still running in production somewhere, unbothered.
A Nostalgic GPU ClusterWhy study historical models? The landscape of large language models did not emerge overnight. Each landmark model introduced a crucial innovation, whether it was bidirectional pre-training, massive scale, the text-to-text framework, or emergent in-context learning. Understanding these models in sequence reveals the compounding insights that led to today's systems. By the end of this section, you will be able to explain why each model mattered and how its ideas persist in current architectures.
This section assumes familiarity with the Transformer architecture (encoder, decoder, and attention mechanisms) covered in Section 4.1. Tokenization concepts from Module 02 (BPE, WordPiece, SentencePiece) will also be referenced throughout.
1. BERT: Bidirectional Understanding
In October 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), a model that fundamentally changed how the NLP community thought about pre-training. Before BERT, language models were trained left-to-right (or right-to-left), seeing only one direction of context at a time. BERT's key innovation was the masked language modeling (MLM) objective, which allowed the model to attend to both left and right context simultaneously.
How BERT Works
BERT takes a sequence of tokens, randomly masks 15% of them, and trains the model to predict the original tokens from the surrounding context. This bidirectional conditioning is powerful because understanding language often requires seeing both what comes before and after a word. Consider the sentence: "The bank was steep and muddy." You need the word "steep" (which comes after "bank") to determine that "bank" refers to a riverbank, not a financial institution.
Architecturally, BERT is a stack of Transformer encoder layers. BERT-Base uses 12 layers, 768 hidden dimensions, and 12 attention heads (110M parameters). BERT-Large scales this to 24 layers, 1024 hidden dimensions, and 16 heads (340M parameters). The model was trained on BookCorpus (800M words) and English Wikipedia (2,500M words).
# Loading and using BERT for masked language modeling from transformers import BertTokenizer, BertForMaskedLM import torch tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') # Mask a token and predict it text = "The capital of France is [MASK]." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] logits = outputs.logits[0, mask_idx, :] top_tokens = logits.topk(5).indices[0] for tok in top_tokens: print(f" {tokenizer.decode([tok])}")
BERT Variants
RoBERTa (2019) demonstrated that BERT was significantly undertrained. By removing the next-sentence prediction objective, training on more data (160GB vs. 16GB), using larger batches, and training longer, RoBERTa achieved substantially better results with the same architecture. This was an important lesson: training procedure matters as much as architecture.
ALBERT (2019) tackled parameter efficiency through two techniques: factorized embedding parameterization (separating the vocabulary embedding size from the hidden layer size) and cross-layer parameter sharing. ALBERT-xxlarge achieved state-of-the-art results with 70% fewer parameters than BERT-Large.
DeBERTa (2020) introduced disentangled attention, which represents each token using two separate vectors encoding content and position. This allowed the model to compute attention scores based on content-to-content, content-to-position, and position-to-content interactions independently. DeBERTa also added an enhanced mask decoder that incorporates absolute position information in the final prediction layer.
2. The GPT Series: Scaling Autoregressive Models
While BERT championed bidirectional encoding, OpenAI pursued a different path: unidirectional, autoregressive language modeling. This design choice, initially seen as a limitation, would prove transformative when combined with scale.
GPT-1 (2018): The Transfer Learning Proof of Concept
GPT-1 demonstrated that a decoder-only Transformer trained on raw text could learn useful representations that transfer to downstream tasks. With 117M parameters trained on BookCorpus, GPT-1 was modest in size. Its contribution was conceptual: unsupervised pre-training followed by supervised fine-tuning produced strong results across a range of NLP tasks, from textual entailment to question answering.
GPT-2 (2019): Emergent Zero-Shot Capabilities
GPT-2 scaled to 1.5 billion parameters and was trained on WebText, a 40GB dataset of web pages linked from Reddit posts with at least 3 karma. The critical discovery was that the model could perform tasks it was never explicitly trained for. By simply conditioning on a prompt like "Translate English to French:", GPT-2 could translate, summarize, and answer questions, all without any task-specific fine-tuning. This was the first compelling demonstration of what we now call zero-shot learning.
# GPT-2: Zero-shot text generation from transformers import GPT2LMHeadModel, GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') # Zero-shot task: summarization via prompting prompt = """Article: The researchers found that training language models on more data consistently improved performance across all tasks. TL;DR:""" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_new_tokens=30, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id ) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GPT-3 (2020): The In-Context Learning Revolution
GPT-3 was a watershed moment. At 175 billion parameters, trained on 300 billion tokens, it demonstrated that scale alone could produce qualitatively new capabilities. The most significant was in-context learning (ICL): by providing a few examples in the prompt, GPT-3 could perform tasks with no gradient updates whatsoever. This "few-shot" paradigm upended the traditional train-then-fine-tune workflow.
GPT-3 came in several sizes, from 125M to 175B parameters, providing the first empirical evidence for smooth scaling laws in language modeling. The paper showed that performance on virtually every benchmark improved predictably as model size increased, following power-law curves.
GPT-3 revealed that task-specific behavior could emerge from sheer scale without task-specific training. A model trained only to predict the next token could answer questions, translate languages, write code, and perform arithmetic, simply because these capabilities were implicit in the pre-training data. This insight drives the entire foundation model paradigm: invest heavily in pre-training, and downstream capabilities follow.
InstructGPT and ChatGPT (2022): Aligning with Human Intent
Raw language models predict likely text, not helpful text. InstructGPT addressed this gap through reinforcement learning from human feedback (RLHF). The process had three stages: supervised fine-tuning on human-written demonstrations, training a reward model on human preference comparisons, and optimizing the language model against that reward using PPO. The resulting model was more helpful, less toxic, and better at following instructions, despite being far smaller (1.3B parameters) than GPT-3.
GPT-4 (2023): Multimodal and Capable
GPT-4 extended the paradigm to multimodal inputs (text and images) while achieving near-human performance on professional exams like the bar exam and medical licensing tests. While OpenAI did not disclose architectural details, the model demonstrated that the scaling hypothesis continued to hold: more compute, more data, and more careful alignment produced qualitatively better systems.
3. T5 and the Text-to-Text Framework
Google's T5 (Text-to-Text Transfer Transformer, 2019) introduced a unifying principle: every NLP task can be framed as converting one text string into another. Classification becomes "sentiment: this movie is great" producing "positive". Translation becomes "translate English to German: Hello" producing "Hallo". Question answering becomes "question: What is the capital of France? context: ..." producing "Paris".
This framework was powerful because it allowed a single model architecture (encoder-decoder Transformer) and a single training objective (predict the target text) to be applied uniformly across tasks. The T5 paper systematically explored many architectural and training choices, providing the field with a comprehensive empirical study.
# T5: Text-to-Text approach for multiple tasks from transformers import T5Tokenizer, T5ForConditionalGeneration tokenizer = T5Tokenizer.from_pretrained('t5-small') model = T5ForConditionalGeneration.from_pretrained('t5-small') # Same model handles translation, summarization, classification tasks = [ "translate English to German: The house is wonderful.", "summarize: State authorities dispatched combatants to the region.", "stsb sentence1: The cat sat. sentence2: The cat rested.", ] for task in tasks: inputs = tokenizer(task, return_tensors="pt", max_length=128, truncation=True) outputs = model.generate(**inputs, max_new_tokens=50) print(f"Input: {task[:50]}...") print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}") print()
T5 used a span corruption pre-training objective rather than BERT's single-token masking. Instead of masking individual tokens, T5 replaces contiguous spans of text with sentinel tokens and trains the model to reconstruct the original spans. This is more efficient because the model learns to predict multiple tokens per masked position. We cover span corruption in detail in Section 6.2.
4. Emergence and Scaling: The Capabilities Threshold
One of the most striking discoveries from the GPT-3 era was the phenomenon of emergent capabilities: abilities that appear suddenly as models grow larger, without being explicitly trained. Small models show essentially zero performance on certain tasks (like multi-step arithmetic or chain-of-thought reasoning), while larger models abruptly demonstrate competence once they cross a critical size threshold.
In-Context Learning
In-context learning (ICL) is perhaps the most consequential emergent capability. When you provide a few input-output examples in a prompt and the model correctly handles a new input, the model is performing ICL. This is remarkable because no gradient updates occur; the model's parameters remain frozen. The examples somehow "program" the model through its forward pass alone.
The GPT-3 paper demonstrated that ICL improves smoothly with model scale. While GPT-3 Small (125M) showed minimal few-shot capability, GPT-3 175B could rival or exceed fine-tuned BERT models on many benchmarks with just a handful of examples.
Chain-of-Thought Reasoning
Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), showed that models could solve complex reasoning problems if prompted to show their work step by step. For example, instead of directly answering "If there are 3 cars in a parking lot and 2 more arrive, how many are there?", the model is prompted to produce intermediate reasoning steps. This capability appears to be emergent: it works well in large models (over 100B parameters) but fails in smaller ones.
The existence of true "emergence" is contested. Schaeffer, Miranda, and Koyejo (2024) argued that many apparent emergent capabilities are artifacts of the chosen evaluation metrics. When using continuous metrics instead of threshold-based accuracy, capabilities often show smooth, predictable improvement rather than sudden phase transitions. We explore this debate further in Section 6.3.
5. The Model Comparison Landscape
By the mid-2020s, the field had settled into several distinct paradigms. The following table summarizes the key landmark models and their defining characteristics.
| Model | Type | Parameters | Key Innovation | Year |
|---|---|---|---|---|
| BERT | Encoder | 110M / 340M | Masked language modeling, bidirectionality | 2018 |
| GPT-2 | Decoder | 1.5B | Zero-shot task transfer | 2019 |
| T5 | Enc-Dec | 11B | Text-to-text unification, span corruption | 2019 |
| GPT-3 | Decoder | 175B | In-context learning, few-shot prompting | 2020 |
| PaLM | Decoder | 540B | Pathways system, breakthrough reasoning | 2022 |
| BLOOM | Decoder | 176B | First open multilingual LLM (46 languages) | 2022 |
| InstructGPT | Decoder | 1.3B | RLHF alignment | 2022 |
| Falcon | Decoder | 180B | Curated web data (RefinedWeb), open training data | 2023 |
| GPT-4 | Decoder | Undisclosed | Multimodal, professional-exam performance | 2023 |
| Llama 2 | Decoder | 7B / 70B | Open-weight high-quality models | 2023 |
6. The Open-Weight Movement
A parallel development reshaped the field from the access side. Meta's release of Llama (2023) and Llama 2 (2023) provided the community with high-quality open-weight models. Unlike GPT-3 and GPT-4, which were accessible only through APIs, Llama allowed researchers and developers to inspect, modify, and fine-tune the models directly. This catalyzed an explosion of derivative work, from Alpaca and Vicuna to specialized domain models.
Before Llama, other landmark open efforts paved the way. BLOOM (2022) was the first large-scale open-science multilingual LLM, covering 46 languages and 13 programming languages. Trained by a consortium of over 1,000 researchers, BLOOM demonstrated that collaborative open science could produce models at the 176B parameter scale. Falcon (2023) from the Technology Innovation Institute showed that data curation was the critical ingredient: its RefinedWeb dataset, carefully filtered from CommonCrawl, powered a 180B parameter model that topped the Open LLM Leaderboard upon release.
On the closed side, Google's PaLM (2022) pushed scale to 540B parameters using the Pathways training infrastructure, demonstrating breakthrough reasoning capabilities including chain-of-thought solving of math word problems. PaLM later evolved into Gemini, Google's multimodal frontier model family.
The open-weight movement demonstrated that the key insights behind powerful language models were not in secret architectures but in training data quality, scale, and careful engineering. Llama 2 70B, trained on 2 trillion tokens, achieved competitive performance with GPT-3.5 while being freely available for research and commercial use.
The landmark models tell a clear story: scale is a reliable lever for capability. Each generation grew larger and trained on more data, and each generation exhibited qualitatively new abilities. But this is not the whole story. RoBERTa showed that training procedure matters. InstructGPT showed that alignment matters. Chinchilla (Section 6.3) showed that the balance between parameters and data matters. The best practitioners combine all these insights.
Check Your Understanding
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- BERT introduced bidirectional pre-training through masked language modeling, defining the encoder-only paradigm for understanding tasks.
- GPT-2 discovered zero-shot capabilities in scaled autoregressive models, and GPT-3 demonstrated that in-context learning could replace fine-tuning for many tasks.
- T5 unified NLP under the text-to-text framework, showing that one architecture and training objective could handle any task.
- InstructGPT proved that alignment via RLHF was essential for making raw language models practically useful.
- Emergent capabilities (in-context learning, chain-of-thought reasoning) appeared at scale, though the nature and reality of emergence remains debated.
- The open-weight movement (Llama, Mistral) democratized access to powerful models, enabling a broad ecosystem of fine-tuned variants.