Module 04: The Transformer Architecture

Chapter Overview

This is the central module of the entire course. The Transformer, introduced in the landmark 2017 paper "Attention Is All You Need," is the architecture behind every modern large language model. In this chapter we will dissect it layer by layer, build one from scratch, survey the many variants that have emerged since, understand the GPU hardware it runs on, and explore the theoretical limits of what Transformers can and cannot compute.

By the end of this module you will be able to read a Transformer implementation, modify it confidently, reason about its computational cost, and understand why certain architectural choices (positional encoding, layer normalization, residual connections) are not arbitrary but deeply principled.

Learning Objectives

Walk through the original Transformer paper and explain every component from positional encodings to output probabilities
Implement a complete decoder-only Transformer in ~300 lines of PyTorch, training it on a small dataset
Compare encoder-only, decoder-only, and encoder-decoder architectures with concrete use cases
Explain efficient attention mechanisms (linear attention, sparse attention, FlashAttention) and their tradeoffs
Describe State Space Models (SSMs/Mamba), Mixture-of-Experts (MoE), RWKV, Gated Attention, and Multi-head Latent Attention
Understand GPU architecture (SMs, memory hierarchy, bandwidth) and write a basic Triton kernel
State the universal approximation and computational complexity results for Transformers, and explain how chain-of-thought reasoning extends their power

Estimated Time

This is the largest module in the course. Plan for 12 to 16 hours of study and coding across all five sections. Section 4.2 (the implementation lab) alone may take 3 to 4 hours the first time through.

Prerequisites

Module 00: Comfortable with PyTorch tensors, autograd, and training loops
Module 02: Understanding of tokenization and vocabulary construction
Module 03: Solid grasp of attention mechanisms (dot-product attention, multi-head attention, KV caches)
Linear algebra: matrix multiplication, softmax, norms

Chapter Overview

Learning Objectives

Estimated Time

Prerequisites

Sections