Section 4.1: Transformer Architecture Deep Dive

1. The Paper That Changed Everything

This is the architecture inside every AI you have ever used. ChatGPT, Claude, Gemini, Llama: they are all Transformers. In June 2017, Vaswani et al. published "Attention Is All You Need," proposing a sequence-to-sequence model that dropped recurrence and convolutions altogether. At its core, the Transformer relies on a single mechanism repeated many times: scaled dot-product attention, combined with simple position-wise feed-forward networks. This section walks through the complete architecture, explaining not just what each component does, but why it exists and how each design choice shapes the flow of information through the network.

The original Transformer is an encoder-decoder model. The encoder reads the entire input sequence in parallel (no sequential bottleneck like an RNN), and the decoder generates the output sequence one token at a time, attending both to the encoder output and to previously generated tokens. While modern LLMs typically use only the decoder half, understanding the full architecture is essential. Many design principles carry over directly.

Big Picture

The Transformer's core insight is that attention alone, applied across all pairs of positions simultaneously, can capture dependencies of arbitrary range without the vanishing gradient problem that plagues RNNs. The cost is quadratic in sequence length, a tradeoff that later sections of this module will address.

2. Information Theory: The Language of Learning

Before diving into the Transformer's mechanics, we need the mathematical vocabulary that describes how well a model is learning. Every time you see a training loss curve, a perplexity score, or a KL divergence penalty in RLHF, you are looking at information theory at work. Claude Shannon formalized these ideas in 1948, and they remain the foundation of how we measure, train, and evaluate language models.

Why Here?

We introduce information theory now because the Transformer's training objective (cross-entropy loss) and its evaluation metric (perplexity) both come directly from these concepts. Understanding them first will make every subsequent discussion of training, loss landscapes, and model comparison more concrete.

2.1 Entropy: Measuring Uncertainty

Entropy quantifies how much uncertainty (or "surprise") a random variable carries. For a discrete random variable X with possible outcomes x and probabilities P(x):

H(X) = -\sum P(x) log 2 P(x)

The unit is bits when we use log base 2. Each bit represents one yes/no question needed to determine the outcome.

Example: the coin flip. A fair coin has P(heads) = P(tails) = 0.5:

H = -(0.5 \cdot log 2 0.5 + 0.5 \cdot log 2 0.5) = 1.0 bit

One bit: you need exactly one yes/no question ("Is it heads?") to determine the result. Now consider a loaded coin with P(heads) = 0.9, P(tails) = 0.1:

H = -(0.9 \cdot log 2 0.9 + 0.1 \cdot log 2 0.1) \approx 0.47 bits

Less uncertainty means lower entropy. You already have a strong guess (heads), so less information is needed to pin down the outcome.

Key Insight

Entropy is maximized when all outcomes are equally likely, and minimized (zero) when the outcome is certain. For language, high entropy means the next token is hard to predict; low entropy means the model is confident. A perfect language model's entropy would match the true entropy of the language.

2.2 Cross-Entropy: The Loss We Minimize

In practice, we do not know the true distribution P of natural language. Instead, our model defines a learned distribution Q. Cross-entropy measures how many bits the model Q needs to encode data drawn from the true distribution P:

H(P, Q) = -\sum P(x) log 2 Q(x)

When Q matches P perfectly, cross-entropy equals entropy: H(P, P) = H(P). Any imperfection in Q pushes cross-entropy higher than entropy. The gap between them is the KL divergence (see below).

In LLM training, P is the one-hot distribution over the correct next token, and Q is the model's softmax output. This simplifies cross-entropy to:

H(P, Q) = -log 2 Q(correct token)

If the model assigns probability 0.9 to the correct token, the loss is about 0.15 bits. If it assigns only 0.01, the loss jumps to about 6.64 bits. Small probabilities get magnified dramatically, which is why the model learns quickly from confident mistakes. (The magnification table in Section 4.4 illustrates this effect in detail.)

2.3 Perplexity: An Intuitive Scorecard

Perplexity converts cross-entropy into a more interpretable number:

Perplexity = 2 H(P, Q)

Perplexity of 100 means the model is, on average, "as confused as if it were choosing uniformly among 100 equally likely options" at every token. Lower is better. A perfect model on English text would have a perplexity equal to 2^H(English), roughly estimated at 20 to 50 depending on the domain.

Historical landmarks help calibrate intuition:

GPT-2 (2019, 1.5B parameters): perplexity around 30 on standard benchmarks.
GPT-3 (2020, 175B parameters): perplexity around 20, a significant improvement.
Modern frontier models: perplexities in the low teens on common benchmarks, though exact numbers depend heavily on the evaluation dataset.

2.4 KL Divergence: Measuring the Gap

Kullback-Leibler divergence measures how much extra cost (in bits) we pay by using the approximate distribution Q instead of the true distribution P:

D KL (P || Q) = H(P, Q) - H(P) = \sum P(x) log 2 [P(x) / Q(x)]

Three essential properties:

Non-negative: D_KL ≥ 0, with equality only when P = Q.
Not symmetric: D_KL(P || Q) ≠ D_KL(Q || P) in general. The direction matters.
Decomposes cross-entropy: Cross-entropy = Entropy + KL divergence. Minimizing cross-entropy is equivalent to minimizing KL divergence, since entropy (of the true distribution) is a constant we cannot change.

Forward Reference: KL in RLHF

In Module 16 (RLHF and alignment), KL divergence plays a critical role: the reward model encourages the fine-tuned policy to improve, while a KL penalty keeps it from straying too far from the base model. Without this constraint, the model can "hack" the reward signal by producing degenerate outputs that score high on the reward model but are incoherent.

2.5 Mutual Information (Brief)

Mutual information I(X; Y) measures how much knowing one variable reduces uncertainty about another:

I(X; Y) = H(X) + H(Y) - H(X, Y)

If X and Y are independent, mutual information is zero. If knowing X completely determines Y, mutual information equals H(Y). In the context of LLMs, mutual information appears in probing studies (Module 17), where researchers measure how much information about a linguistic property (syntax, semantics) is captured in the model's hidden representations. It also informs information-theoretic evaluation metrics that go beyond simple perplexity.

2.6 Code Example: Computing the Metrics

import numpy as np

# --- Entropy ---
def entropy(probs):
    """H(X) = -sum P(x) log2 P(x), ignoring zero probabilities."""
    probs = np.array(probs, dtype=np.float64)
    mask = probs > 0
    return -np.sum(probs[mask] * np.log2(probs[mask]))

fair_coin = [0.5, 0.5]
loaded_coin = [0.9, 0.1]
print(f"Fair coin entropy:   {entropy(fair_coin):.4f} bits")   # 1.0000
print(f"Loaded coin entropy: {entropy(loaded_coin):.4f} bits") # 0.4690

# --- Cross-Entropy ---
def cross_entropy(p, q):
    """H(P, Q) = -sum P(x) log2 Q(x)."""
    p, q = np.array(p, dtype=np.float64), np.array(q, dtype=np.float64)
    mask = p > 0
    return -np.sum(p[mask] * np.log2(q[mask]))

# True distribution vs. model prediction
p_true = [0.0, 1.0, 0.0]       # correct token is index 1
q_good = [0.05, 0.90, 0.05]    # confident model
q_bad  = [0.30, 0.40, 0.30]    # uncertain model

print(f"\nCross-entropy (good model): {cross_entropy(p_true, q_good):.4f} bits")  # 0.1520
print(f"Cross-entropy (bad model):  {cross_entropy(p_true, q_bad):.4f} bits")     # 1.3219

# --- Perplexity ---
def perplexity(p, q):
    """Perplexity = 2^(cross-entropy)."""
    return 2 ** cross_entropy(p, q)

print(f"\nPerplexity (good model): {perplexity(p_true, q_good):.2f}")  # 1.11
print(f"Perplexity (bad model):  {perplexity(p_true, q_bad):.2f}")    # 2.50

# --- KL Divergence ---
def kl_divergence(p, q):
    """D_KL(P || Q) = sum P(x) log2(P(x) / Q(x))."""
    p, q = np.array(p, dtype=np.float64), np.array(q, dtype=np.float64)
    mask = p > 0
    return np.sum(p[mask] * np.log2(p[mask] / q[mask]))

p_lang = [0.7, 0.2, 0.1]
q_model = [0.5, 0.3, 0.2]
print(f"\nKL divergence D_KL(P||Q): {kl_divergence(p_lang, q_model):.4f} bits")
# Verify: cross-entropy = entropy + KL
ce = cross_entropy(p_lang, q_model)
h  = entropy(p_lang)
kl = kl_divergence(p_lang, q_model)
print(f"H(P,Q)={ce:.4f}  H(P)={h:.4f}  D_KL={kl:.4f}  H(P)+D_KL={h+kl:.4f}")

2.7 Visualizing the Relationships

Figure 2.1: The information-theoretic decomposition. Cross-entropy splits into the irreducible entropy of the language plus the KL divergence that training minimizes. Perplexity exponentiates cross-entropy into an intuitive scale.

2.8 Comparison Table

Metric	Formula	Interpretation	Where Used in This Course
Entropy	H(P) = −∑ P(x) log₂ P(x)	Inherent uncertainty in the true distribution	Theoretical lower bound on loss (Sec. 4.1, Module 14)
Cross-Entropy	H(P,Q) = −∑ P(x) log₂ Q(x)	Cost of encoding P using model Q	Training loss for all LLMs (Modules 4, 8, 14)
Perplexity	2^H(P,Q)	Effective vocabulary size of model's uncertainty	Evaluation metric (Modules 5, 14, 15)
KL Divergence	∑ P(x) log₂[P(x)/Q(x)]	Extra bits wasted by using Q instead of P	RLHF penalty (Module 16), distillation (Module 15)
Mutual Information	H(X) + H(Y) − H(X,Y)	Shared information between two variables	Probing studies (Module 17), information-theoretic eval

3. High-Level Architecture

The Transformer consists of two stacks: an encoder (N=6 identical layers) and a decoder (N=6 identical layers). Each encoder layer has two sub-layers: (1) a multi-head self-attention mechanism and (2) a position-wise feed-forward network. Each decoder layer has three sub-layers: (1) masked multi-head self-attention, (2) multi-head cross-attention over the encoder output, and (3) a position-wise feed-forward network. Every sub-layer is wrapped in a residual connection followed by layer normalization.

Figure 4.1: High-level view of the encoder-decoder Transformer. Each sub-layer is wrapped with a residual connection and layer normalization.

4. Input Representation and Positional Encoding

4.1 Token Embeddings

The first step is converting discrete tokens into continuous vectors. A learned embedding matrix W_E ∈ R^{V × d} maps each token index to a d-dimensional vector. In the original paper, d = 512 and V ≈ 37,000 (BPE tokens for English-German translation). The embedding weights are multiplied by √d to bring their scale in line with the positional encodings that are added next.

# Token embedding with scaling
import torch
import torch.nn as nn

class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size, d_model):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.scale = d_model ** 0.5

    def forward(self, x):
        return self.embed(x) * self.scale

4.2 Why We Need Positional Encoding

Self-attention is a set operation: it is permutation-equivariant, meaning that if you shuffle the input tokens, the outputs shuffle in the same way. Without any notion of position, the model cannot distinguish "the cat sat on the mat" from "mat the on sat cat the." Positional encoding injects ordering information into the representation.

4.3 Sinusoidal Positional Encoding

The original paper uses a fixed (non-learned) encoding based on sine and cosine functions of different frequencies:

PE(pos, 2i) = sin(pos / 10000 2i/d) PE(pos, 2i+1) = cos(pos / 10000 2i/d)

Here pos is the position in the sequence and i is the dimension index. Each dimension oscillates at a different frequency, forming a unique "barcode" for each position. The key property: for any fixed offset k, the encoding at position pos + k can be written as a linear function of the encoding at position pos. This allows the model to learn relative position patterns through linear projections.

import math

class SinusoidalPE(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))  # (1, max_len, d_model)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1)]

Note: Learned vs. Sinusoidal

GPT-2 and many later models use learned positional embeddings instead, which are simply an additional embedding table indexed by position. Empirically, both approaches work comparably for training-length sequences, but sinusoidal encodings can extrapolate to longer sequences more gracefully. Modern approaches like RoPE (Rotary Position Embedding) combine the best of both worlds and are discussed in Section 4.3.

Figure 4.2: Each row is a position; each column a dimension. Low-index dimensions oscillate quickly (high frequency) while high-index dimensions change slowly, creating a unique fingerprint per position.

5. Scaled Dot-Product Attention (Revisited)

We covered attention in Module 3, but let us revisit it through the lens of the full Transformer. The attention function maps a query and a set of key-value pairs to an output. All are vectors. The output is a weighted sum of the values, where each weight is determined by the compatibility of the query with the corresponding key:

Attention(Q, K, V) = softmax(QK T / \sqrtd k) V

The division by √d_k is crucial. Without it, when d_k is large, the dot products grow large in magnitude, pushing the softmax into regions where it has extremely small gradients (the saturation problem). Vaswani et al. provide an elegant information-theoretic argument: if the components of Q and K are independent random variables with mean 0 and variance 1, then their dot product has mean 0 and variance d_k. Dividing by √d_k restores unit variance.

Why Divide by sqrt(d_k)?

Without scaling, dot products grow in magnitude as d_k increases. For d_k = 64, dot products have a standard deviation of 8, which pushes many softmax inputs into extreme tails where gradients are nearly zero. Dividing by sqrt(d_k) = 8 restores the standard deviation to 1, keeping the softmax in its sensitive regime where small changes in input produce meaningful changes in output.

5.1 Multi-Head Attention

Instead of performing a single attention function with d-dimensional keys, values, and queries, the Transformer linearly projects them h times with different learned projections, performs attention in parallel on each projection, concatenates the results, and projects again:

MultiHead(Q, K, V) = Concat(head 1, ..., head h) W O where head i = Attention(Q W i Q, K W i K, V W i V)

With h = 8 heads and d = 512, each head operates on d_k = d_v = 64 dimensions. This is computationally equivalent to single-head attention with d = 512, but it allows the model to jointly attend to information from different representation subspaces at different positions. One head might learn syntactic dependencies while another captures semantic relatedness.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k = d_model // n_heads
        self.n_heads = n_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        B, T, C = q.shape

        # Project and reshape: (B, T, d) -> (B, h, T, d_k)
        q = self.W_q(q).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = self.W_k(k).view(B, -1, self.n_heads, self.d_k).transpose(1, 2)
        v = self.W_v(v).view(B, -1, self.n_heads, self.d_k).transpose(1, 2)

        # Scaled dot-product attention
        scores = (q @ k.transpose(-2, -1)) / (self.d_k ** 0.5)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = torch.softmax(scores, dim=-1)

        # Combine heads
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.W_o(out)

Key Insight: Why Multiple Heads?

A single attention head computes one set of attention weights. If position 5 needs to attend to both position 2 (for syntax) and position 8 (for coreference), a single softmax distribution forces a compromise. Multiple heads let the model maintain multiple, independent attention patterns simultaneously. Think of each head as a different "question" the model can ask about the context.

6. Position-Wise Feed-Forward Network

After every attention sub-layer, the Transformer applies a simple two-layer feed-forward network to each position independently and identically:

FFN(x) = max(0, xW 1 + b 1) W 2 + b 2

This is applied to each token position separately (hence "position-wise"). The inner dimension is typically 4 times the model dimension: with d = 512, the inner layer has d_ff = 2048 units. The FFN accounts for roughly two-thirds of the parameters in each Transformer layer.

Why is the FFN important? Attention allows tokens to mix information across positions, but it is a linear operation over the value vectors (the softmax produces convex combination weights). The FFN provides the per-token nonlinear transformation that is essential for the model to learn complex functions. Think of attention as routing information and the FFN as processing it.

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

Paper Spotlight: FFN Layers as Knowledge Memories

If attention is the "reading" step (gathering information from across the sequence), the FFN is the "thinking" step (processing gathered information for each position independently). Geva et al. (2021) showed that FFN layers act as learned key-value memories: each row of the first weight matrix detects a pattern, and the corresponding row of the second weight matrix stores associated knowledge. When the model "knows" that Paris is the capital of France, that knowledge is likely stored in an FFN layer, not in attention.

Modern Variants: SwiGLU and GeGLU

Most modern Transformers replace the ReLU FFN with a gated variant. The SwiGLU activation (used in LLaMA, PaLM, and others) splits the first linear projection into two branches and multiplies them element-wise: FFN(x) = (xW₁ ⊙ SiLU(xW_gate)) W₂. This consistently improves performance at a modest increase in parameter count.

7. Residual Connections

Every sub-layer in the Transformer is wrapped with a residual (skip) connection:

output = SubLayer(x) + x

Residual connections, introduced in ResNet (He et al., 2016), solve the degradation problem in deep networks: as you add more layers, training loss can increase because the optimization landscape becomes harder to navigate. A residual connection provides a gradient highway that allows gradients to flow directly from the output back to earlier layers without attenuation.

7.1 The Information-Theoretic View

From an information flow perspective, residual connections ensure that the original input to each layer is preserved. Each sub-layer only needs to learn the delta (the difference between the desired output and the input). This is a much easier optimization target. If a layer has nothing useful to add, it can learn to output near-zero, effectively becoming an identity function. Without residuals, each layer must learn to pass through all information, including what it does not modify.

In a Transformer with N layers, the residual connections create 2^N possible paths through the network (each sub-layer can be either included or skipped). This ensemble-like behavior helps explain the robustness of deep Transformers.

8. Layer Normalization

Layer normalization (Ba, Kiros, and Hinton, 2016) normalizes the activations across the feature dimension for each individual token:

LayerNorm(x) = γ ⊙ (x - μ) / (σ + ε) + β

where μ and σ are the mean and standard deviation computed across the feature dimensions of a single token, γ and β are learned scale and shift parameters, and ε is a small constant for numerical stability.

8.1 Pre-LN vs. Post-LN

The original paper applies layer normalization after the residual addition (Post-LN): LayerNorm(x + SubLayer(x)). Most modern Transformers use Pre-LN, applying normalization before the sub-layer: x + SubLayer(LayerNorm(x)).

Property	Post-LN (Original)	Pre-LN (Modern)
Gradient scale	Depends on depth; can explode	Roughly constant across layers
Warmup required?	Yes, critical for stability	Often trains without warmup
Final performance	Slightly higher ceiling (some studies)	Slightly lower but more stable
Used in	Original Transformer, BERT	GPT-2, GPT-3, LLaMA, most modern LLMs

Practical Warning

Pre-LN is the default for good reason: Post-LN training can diverge catastrophically without learning rate warmup and careful initialization. If you are building a new model and have no compelling reason to use Post-LN, choose Pre-LN. When using Pre-LN, remember to add a final layer normalization after the last Transformer block (before the output projection), since the sub-layer output is not normalized.

Figure 4.3: Post-LN (left) applies normalization after the residual addition. Pre-LN (right) normalizes the input before the sub-layer, placing the residual path outside the normalization.

9. Weight Initialization

Proper initialization is critical for training deep Transformers. The standard approach uses Xavier (Glorot) initialization for most weights: values are drawn from a uniform or normal distribution with variance 2 / (fan_in + fan_out). This ensures that the variance of activations stays roughly constant as they propagate forward through layers.

A subtle but important refinement, used in GPT-2 and later models, is to scale the initialization of the output projection in the residual path by 1 / √(2N), where N is the number of layers. The factor of 2 comes from having two residual sub-layers per block (attention and FFN). This keeps the residual stream variance from growing as O(N) through the network.

def init_weights(module, n_layers):
    """GPT-2 style initialization."""
    if isinstance(module, nn.Linear):
        nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        nn.init.normal_(module.weight, mean=0.0, std=0.02)

def scale_residual_init(module, n_layers):
    """Scale output projections in residual blocks."""
    for name, param in module.named_parameters():
        if name.endswith('W_o.weight') or name.endswith('net.3.weight'):
            # net.3 is the second linear layer in the FFN
            nn.init.normal_(param, mean=0.0, std=0.02 / (2 * n_layers) ** 0.5)

10. The Causal Mask (Decoder Self-Attention)

In auto-regressive language models (and in the decoder of the original Transformer), each position can only attend to itself and to earlier positions. This is enforced by a causal mask: an upper-triangular matrix of -inf values added to the attention scores before the softmax. The softmax converts -inf to zero, effectively blocking information flow from future tokens.

def causal_mask(seq_len, device):
    """Returns a boolean mask: True = allowed, False = blocked."""
    return torch.tril(torch.ones(seq_len, seq_len, device=device, dtype=torch.bool))

# Usage in attention:
# scores.masked_fill_(~mask, float('-inf'))

Figure 4.4: The causal (lower-triangular) attention mask for a 4-token sequence. Each row represents a query position; each column a key position. Green cells (1) allow attention; red cells (0) are set to negative infinity before softmax, blocking information flow from future tokens.

The mask ensures that the prediction for position t depends only on tokens at positions 0, 1, ..., t. This is what makes the model auto-regressive: during generation, each new token can be produced by conditioning only on the tokens generated so far.

11. The Complete Forward Pass

Let us trace a single forward pass through a decoder-only Transformer, the architecture used by GPT and most modern LLMs:

Tokenize the input text into a sequence of integer token IDs.
Embed the tokens: look up each ID in the embedding table and scale by √d.
Add positional encoding (sinusoidal, learned, or RoPE).
For each of the N Transformer blocks:
1. Apply Layer Normalization (Pre-LN).
2. Compute Masked Multi-Head Self-Attention (with the causal mask).
3. Add the residual (skip connection).
4. Apply Layer Normalization (Pre-LN).
5. Apply the Feed-Forward Network.
6. Add the residual.
Apply a final Layer Normalization.
Project to vocabulary size with a linear layer (often weight-tied with the embedding matrix).
Apply softmax to obtain next-token probabilities.

Key Insight: Weight Tying

Many models share ("tie") the embedding matrix and the final output projection matrix. Since both map between d-dimensional space and vocabulary space, sharing weights reduces parameter count significantly (by V × d parameters) and provides a useful inductive bias: similar tokens should have similar embeddings and similar output logits.

12. Information Flow Through the Residual Stream

A powerful mental model for understanding Transformers (popularized by Elhage et al. at Anthropic) is the residual stream perspective. Instead of viewing the Transformer as a sequence of layers, imagine a single stream of vectors (one per position) flowing from the input embedding to the output. Each attention layer and each FFN layer reads from and writes to this stream additively:

x out = x embed + \sum l (Attn l (x) + FFN l (x))

Each sub-layer sees the accumulated state of the residual stream up to that point, performs some computation, and adds its contribution back. This means that earlier layers can communicate with later layers directly through the residual stream, without the information needing to "pass through" every intermediate layer. It also means that deleting a layer from a trained Transformer may be less catastrophic than you might expect; the residual stream carries forward most of the information regardless.

Figure 4.5: The residual stream perspective (Elhage et al., 2021). The Transformer's residual path acts as a shared communication channel. Each attention and FFN sub-layer reads from the stream and adds its output back, rather than sequentially transforming a single representation.

This perspective is central to the field of mechanistic interpretability, where researchers decompose the behavior of trained Transformers into the contributions of individual heads and MLP layers. We will return to this in a later module.

13. Putting It All Together: Parameter Counts

For a Transformer with N layers, model dimension d, feed-forward dimension d_ff = 4d, h attention heads, and vocabulary size V, the parameter count is approximately:

Component	Parameters per Layer	Notes
Attention (Q, K, V, O)	4d²	Four weight matrices, each d × d
FFN (two linears)	8d²	d × 4d + 4d × d
LayerNorms (2 per block)	4d	Scale and shift, each d
Per-layer total	≈ 12d²	Dominated by FFN + Attention
Embedding + Output	V × d	Halved with weight tying
Total (no tying)	12Nd² + 2Vd

For GPT-3 (N=96, d=12288, V=50257), this gives roughly 175 billion parameters. The FFN contributes about twice as many parameters as the attention layers, a ratio that remains consistent across model sizes.

Key Takeaways

The Transformer processes all positions in parallel using self-attention plus feed-forward networks, avoiding the sequential bottleneck of RNNs.
Positional encoding injects ordering information that attention alone cannot capture.
Multi-head attention lets the model attend to multiple aspects of context simultaneously.
The FFN provides the essential nonlinear per-token transformation; it holds roughly 2/3 of each layer's parameters.
Residual connections create gradient highways and enable an "ensemble" of paths through the network.
Pre-LN ordering is preferred in modern models for training stability.
Careful initialization (Xavier + residual path scaling) prevents variance explosion in deep models.
The residual stream perspective views each sub-layer as reading from and writing to a shared communication channel.

Check Your Understanding

1. Why does the original Transformer scale dot products by 1/√d_k?

Show Answer

When queries and keys have components that are independent with zero mean and unit variance, their dot product has variance d_k. Large dot products push softmax into saturated regions with very small gradients. Dividing by √d_k restores unit variance, keeping the softmax in a well-behaved range.

2. What is the difference between Pre-LN and Post-LN, and which is preferred in practice?

Show Answer

Post-LN applies LayerNorm after the residual addition: LN(x + SubLayer(x)). Pre-LN applies it before: x + SubLayer(LN(x)). Pre-LN is preferred in modern models because it produces more stable gradients across layers and typically does not require learning rate warmup.

3. Why are residual connections essential in deep Transformers?

Show Answer

Residual connections (a) provide a gradient highway that prevents vanishing gradients in deep networks, (b) allow each layer to learn only the incremental change (delta) rather than the full transformation, making optimization easier, and (c) create an exponential number of effective paths through the network, giving an ensemble-like benefit.

4. In a 12-layer model with d=768, approximately how many parameters are in the Transformer blocks (excluding embeddings)?

Show Answer

Using the formula 12Nd²: 12 × 12 × 768² = 12 × 12 × 589,824 ≈ 85 million parameters. This is roughly the size of GPT-2 Small (the full model is ~124M including embeddings).

5. What does the "residual stream" perspective mean, and why is it useful?

Show Answer

The residual stream views the Transformer as a single stream of vectors flowing from input to output, with each attention and FFN sub-layer reading from and writing to this stream additively. This is useful because it reveals that layers can communicate without passing information through every intermediate layer. It also enables mechanistic interpretability by decomposing the final output into contributions from individual sub-layers.

6. Why does GPT-2 scale the initialization of residual output projections by 1/√(2N)?

Show Answer

Each Transformer block adds two residual contributions (from attention and FFN). Without scaling, the variance of the residual stream grows linearly with the number of layers. Scaling each residual contribution by 1/√(2N) ensures that the total accumulated variance remains roughly O(1) regardless of model depth.