Module 06 · Section 6.1

The Landmark Models

From BERT to GPT-4: tracing the key models that defined modern language AI

Every few months someone trains a model so large it makes the previous one look like a pocket calculator. And every few months, the previous one is still running in production somewhere, unbothered.

A Nostalgic GPU Cluster
★ Big Picture

Why study historical models? The landscape of large language models did not emerge overnight. Each landmark model introduced a crucial innovation, whether it was bidirectional pre-training, massive scale, the text-to-text framework, or emergent in-context learning. Understanding these models in sequence reveals the compounding insights that led to today's systems. By the end of this section, you will be able to explain why each model mattered and how its ideas persist in current architectures.

⚙ Prerequisites

This section assumes familiarity with the Transformer architecture (encoder, decoder, and attention mechanisms) covered in Section 4.1. Tokenization concepts from Module 02 (BPE, WordPiece, SentencePiece) will also be referenced throughout.

1. BERT: Bidirectional Understanding

In October 2018, Google released BERT (Bidirectional Encoder Representations from Transformers), a model that fundamentally changed how the NLP community thought about pre-training. Before BERT, language models were trained left-to-right (or right-to-left), seeing only one direction of context at a time. BERT's key innovation was the masked language modeling (MLM) objective, which allowed the model to attend to both left and right context simultaneously.

How BERT Works

BERT takes a sequence of tokens, randomly masks 15% of them, and trains the model to predict the original tokens from the surrounding context. This bidirectional conditioning is powerful because understanding language often requires seeing both what comes before and after a word. Consider the sentence: "The bank was steep and muddy." You need the word "steep" (which comes after "bank") to determine that "bank" refers to a riverbank, not a financial institution.

Architecturally, BERT is a stack of Transformer encoder layers. BERT-Base uses 12 layers, 768 hidden dimensions, and 12 attention heads (110M parameters). BERT-Large scales this to 24 layers, 1024 hidden dimensions, and 16 heads (340M parameters). The model was trained on BookCorpus (800M words) and English Wikipedia (2,500M words).

# Loading and using BERT for masked language modeling
from transformers import BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Mask a token and predict it
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
    logits = outputs.logits[0, mask_idx, :]
    top_tokens = logits.topk(5).indices[0]

for tok in top_tokens:
    print(f"  {tokenizer.decode([tok])}")
paris lyon lille toulouse marseille

BERT Variants

RoBERTa (2019) demonstrated that BERT was significantly undertrained. By removing the next-sentence prediction objective, training on more data (160GB vs. 16GB), using larger batches, and training longer, RoBERTa achieved substantially better results with the same architecture. This was an important lesson: training procedure matters as much as architecture.

ALBERT (2019) tackled parameter efficiency through two techniques: factorized embedding parameterization (separating the vocabulary embedding size from the hidden layer size) and cross-layer parameter sharing. ALBERT-xxlarge achieved state-of-the-art results with 70% fewer parameters than BERT-Large.

DeBERTa (2020) introduced disentangled attention, which represents each token using two separate vectors encoding content and position. This allowed the model to compute attention scores based on content-to-content, content-to-position, and position-to-content interactions independently. DeBERTa also added an enhanced mask decoder that incorporates absolute position information in the final prediction layer.

2018 BERT 110M / 340M Bidirectional MLM 2019 RoBERTa 355M Better training 2020 DeBERTa 1.5B Disentangled attn 2021 DeBERTa V3 304M ELECTRA + DeBERTa 2024 ModernBERT 395M Flash Attn + RoPE Encoder-only model evolution: from BERT to ModernBERT
Figure 6.1.1: Timeline of encoder-only model evolution, showing key innovations at each step.

2. The GPT Series: Scaling Autoregressive Models

While BERT championed bidirectional encoding, OpenAI pursued a different path: unidirectional, autoregressive language modeling. This design choice, initially seen as a limitation, would prove transformative when combined with scale.

GPT-1 (2018): The Transfer Learning Proof of Concept

GPT-1 demonstrated that a decoder-only Transformer trained on raw text could learn useful representations that transfer to downstream tasks. With 117M parameters trained on BookCorpus, GPT-1 was modest in size. Its contribution was conceptual: unsupervised pre-training followed by supervised fine-tuning produced strong results across a range of NLP tasks, from textual entailment to question answering.

GPT-2 (2019): Emergent Zero-Shot Capabilities

GPT-2 scaled to 1.5 billion parameters and was trained on WebText, a 40GB dataset of web pages linked from Reddit posts with at least 3 karma. The critical discovery was that the model could perform tasks it was never explicitly trained for. By simply conditioning on a prompt like "Translate English to French:", GPT-2 could translate, summarize, and answer questions, all without any task-specific fine-tuning. This was the first compelling demonstration of what we now call zero-shot learning.

# GPT-2: Zero-shot text generation
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Zero-shot task: summarization via prompting
prompt = """Article: The researchers found that training language models
on more data consistently improved performance across all tasks.
TL;DR:"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=30,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

GPT-3 (2020): The In-Context Learning Revolution

GPT-3 was a watershed moment. At 175 billion parameters, trained on 300 billion tokens, it demonstrated that scale alone could produce qualitatively new capabilities. The most significant was in-context learning (ICL): by providing a few examples in the prompt, GPT-3 could perform tasks with no gradient updates whatsoever. This "few-shot" paradigm upended the traditional train-then-fine-tune workflow.

GPT-3 came in several sizes, from 125M to 175B parameters, providing the first empirical evidence for smooth scaling laws in language modeling. The paper showed that performance on virtually every benchmark improved predictably as model size increased, following power-law curves.

💡 Key Insight

GPT-3 revealed that task-specific behavior could emerge from sheer scale without task-specific training. A model trained only to predict the next token could answer questions, translate languages, write code, and perform arithmetic, simply because these capabilities were implicit in the pre-training data. This insight drives the entire foundation model paradigm: invest heavily in pre-training, and downstream capabilities follow.

InstructGPT and ChatGPT (2022): Aligning with Human Intent

Raw language models predict likely text, not helpful text. InstructGPT addressed this gap through reinforcement learning from human feedback (RLHF). The process had three stages: supervised fine-tuning on human-written demonstrations, training a reward model on human preference comparisons, and optimizing the language model against that reward using PPO. The resulting model was more helpful, less toxic, and better at following instructions, despite being far smaller (1.3B parameters) than GPT-3.

GPT-4 (2023): Multimodal and Capable

GPT-4 extended the paradigm to multimodal inputs (text and images) while achieving near-human performance on professional exams like the bar exam and medical licensing tests. While OpenAI did not disclose architectural details, the model demonstrated that the scaling hypothesis continued to hold: more compute, more data, and more careful alignment produced qualitatively better systems.

Parameters (log scale) Year 100M 1B 10B 100B 1T+ 2018 2019 2020 2022 2023 1 GPT-1 (117M) B BERT (340M) 2 GPT-2 (1.5B) T5 T5-11B 3 GPT-3 (175B) I InstructGPT 4 GPT-4 (est. 1T+)
Figure 6.1.2: The exponential growth in model parameters from GPT-1 to GPT-4, alongside BERT and T5 for reference.

3. T5 and the Text-to-Text Framework

Google's T5 (Text-to-Text Transfer Transformer, 2019) introduced a unifying principle: every NLP task can be framed as converting one text string into another. Classification becomes "sentiment: this movie is great" producing "positive". Translation becomes "translate English to German: Hello" producing "Hallo". Question answering becomes "question: What is the capital of France? context: ..." producing "Paris".

This framework was powerful because it allowed a single model architecture (encoder-decoder Transformer) and a single training objective (predict the target text) to be applied uniformly across tasks. The T5 paper systematically explored many architectural and training choices, providing the field with a comprehensive empirical study.

# T5: Text-to-Text approach for multiple tasks
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Same model handles translation, summarization, classification
tasks = [
    "translate English to German: The house is wonderful.",
    "summarize: State authorities dispatched combatants to the region.",
    "stsb sentence1: The cat sat. sentence2: The cat rested.",
]

for task in tasks:
    inputs = tokenizer(task, return_tensors="pt", max_length=128, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=50)
    print(f"Input:  {task[:50]}...")
    print(f"Output: {tokenizer.decode(outputs[0], skip_special_tokens=True)}")
    print()
📝 Note

T5 used a span corruption pre-training objective rather than BERT's single-token masking. Instead of masking individual tokens, T5 replaces contiguous spans of text with sentinel tokens and trains the model to reconstruct the original spans. This is more efficient because the model learns to predict multiple tokens per masked position. We cover span corruption in detail in Section 6.2.

4. Emergence and Scaling: The Capabilities Threshold

One of the most striking discoveries from the GPT-3 era was the phenomenon of emergent capabilities: abilities that appear suddenly as models grow larger, without being explicitly trained. Small models show essentially zero performance on certain tasks (like multi-step arithmetic or chain-of-thought reasoning), while larger models abruptly demonstrate competence once they cross a critical size threshold.

In-Context Learning

In-context learning (ICL) is perhaps the most consequential emergent capability. When you provide a few input-output examples in a prompt and the model correctly handles a new input, the model is performing ICL. This is remarkable because no gradient updates occur; the model's parameters remain frozen. The examples somehow "program" the model through its forward pass alone.

The GPT-3 paper demonstrated that ICL improves smoothly with model scale. While GPT-3 Small (125M) showed minimal few-shot capability, GPT-3 175B could rival or exceed fine-tuned BERT models on many benchmarks with just a handful of examples.

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), showed that models could solve complex reasoning problems if prompted to show their work step by step. For example, instead of directly answering "If there are 3 cars in a parking lot and 2 more arrive, how many are there?", the model is prompted to produce intermediate reasoning steps. This capability appears to be emergent: it works well in large models (over 100B parameters) but fails in smaller ones.

⚠ Debate Alert

The existence of true "emergence" is contested. Schaeffer, Miranda, and Koyejo (2024) argued that many apparent emergent capabilities are artifacts of the chosen evaluation metrics. When using continuous metrics instead of threshold-based accuracy, capabilities often show smooth, predictable improvement rather than sudden phase transitions. We explore this debate further in Section 6.3.

5. The Model Comparison Landscape

By the mid-2020s, the field had settled into several distinct paradigms. The following table summarizes the key landmark models and their defining characteristics.

Model Type Parameters Key Innovation Year
BERT Encoder 110M / 340M Masked language modeling, bidirectionality 2018
GPT-2 Decoder 1.5B Zero-shot task transfer 2019
T5 Enc-Dec 11B Text-to-text unification, span corruption 2019
GPT-3 Decoder 175B In-context learning, few-shot prompting 2020
PaLM Decoder 540B Pathways system, breakthrough reasoning 2022
BLOOM Decoder 176B First open multilingual LLM (46 languages) 2022
InstructGPT Decoder 1.3B RLHF alignment 2022
Falcon Decoder 180B Curated web data (RefinedWeb), open training data 2023
GPT-4 Decoder Undisclosed Multimodal, professional-exam performance 2023
Llama 2 Decoder 7B / 70B Open-weight high-quality models 2023

6. The Open-Weight Movement

A parallel development reshaped the field from the access side. Meta's release of Llama (2023) and Llama 2 (2023) provided the community with high-quality open-weight models. Unlike GPT-3 and GPT-4, which were accessible only through APIs, Llama allowed researchers and developers to inspect, modify, and fine-tune the models directly. This catalyzed an explosion of derivative work, from Alpaca and Vicuna to specialized domain models.

Before Llama, other landmark open efforts paved the way. BLOOM (2022) was the first large-scale open-science multilingual LLM, covering 46 languages and 13 programming languages. Trained by a consortium of over 1,000 researchers, BLOOM demonstrated that collaborative open science could produce models at the 176B parameter scale. Falcon (2023) from the Technology Innovation Institute showed that data curation was the critical ingredient: its RefinedWeb dataset, carefully filtered from CommonCrawl, powered a 180B parameter model that topped the Open LLM Leaderboard upon release.

On the closed side, Google's PaLM (2022) pushed scale to 540B parameters using the Pathways training infrastructure, demonstrating breakthrough reasoning capabilities including chain-of-thought solving of math word problems. PaLM later evolved into Gemini, Google's multimodal frontier model family.

The open-weight movement demonstrated that the key insights behind powerful language models were not in secret architectures but in training data quality, scale, and careful engineering. Llama 2 70B, trained on 2 trillion tokens, achieved competitive performance with GPT-3.5 while being freely available for research and commercial use.

💡 Key Insight

The landmark models tell a clear story: scale is a reliable lever for capability. Each generation grew larger and trained on more data, and each generation exhibited qualitatively new abilities. But this is not the whole story. RoBERTa showed that training procedure matters. InstructGPT showed that alignment matters. Chinchilla (Section 6.3) showed that the balance between parameters and data matters. The best practitioners combine all these insights.

Check Your Understanding

1. What is the fundamental difference between BERT's and GPT's pre-training approach?
Show Answer
BERT uses masked language modeling (MLM), where it attends bidirectionally to both left and right context and predicts randomly masked tokens. GPT uses causal language modeling (CLM), where it processes tokens left-to-right and predicts only the next token, never attending to future positions. BERT excels at understanding tasks (classification, NER), while GPT excels at generation tasks.
2. Why was GPT-3's in-context learning considered a paradigm shift?
Show Answer
In-context learning eliminated the need for task-specific fine-tuning. By providing a few examples in the prompt, GPT-3 could perform new tasks without any gradient updates or parameter changes. This meant a single pre-trained model could serve as a general-purpose tool, replacing the previous paradigm of training separate models (or fine-tuning) for each downstream task.
3. What problem did InstructGPT solve, and how?
Show Answer
InstructGPT addressed the alignment problem: raw language models predict likely text rather than helpful text. It used three stages of training: (1) supervised fine-tuning on human-written responses, (2) training a reward model from human preference comparisons, and (3) optimizing the language model against the reward model using PPO. A small (1.3B parameter) InstructGPT model was preferred by humans over the much larger 175B GPT-3.
4. What was T5's unifying contribution to NLP?
Show Answer
T5 reframed every NLP task as a text-to-text problem. Whether the task is translation, summarization, classification, or question answering, both input and output are treated as text strings. This allowed a single encoder-decoder architecture with a single training procedure to be applied uniformly across all tasks, eliminating the need for task-specific output heads or architectures.
5. Why is the debate about emergent capabilities important for practitioners?
Show Answer
If emergence is real (discontinuous jumps in capability at certain scales), then predicting what a model can do requires actually training at that scale; you cannot extrapolate from smaller models. If emergence is a measurement artifact (Schaeffer et al. 2024), then capabilities improve smoothly and predictably, making it easier to plan what size model you need for a given task. This distinction affects resource allocation, risk assessment, and model selection decisions.

Key Takeaways