Module 04

The Transformer Architecture

Part I: Foundations

Chapter Overview

This is the central module of the entire course. The Transformer, introduced in the landmark 2017 paper "Attention Is All You Need," is the architecture behind every modern large language model. In this chapter we will dissect it layer by layer, build one from scratch, survey the many variants that have emerged since, understand the GPU hardware it runs on, and explore the theoretical limits of what Transformers can and cannot compute.

By the end of this module you will be able to read a Transformer implementation, modify it confidently, reason about its computational cost, and understand why certain architectural choices (positional encoding, layer normalization, residual connections) are not arbitrary but deeply principled.

Learning Objectives

Estimated Time

This is the largest module in the course. Plan for 12 to 16 hours of study and coding across all five sections. Section 4.2 (the implementation lab) alone may take 3 to 4 hours the first time through.

Prerequisites

Sections