Module 17: Interpretability & Mechanistic Understanding

Chapter Overview

As large language models are deployed in high-stakes applications, the question "why did the model produce this output?" becomes critical. Interpretability research aims to open the black box of transformer models, revealing the internal computations that drive predictions, the features that neurons encode, and the circuits that implement specific behaviors.

This module covers the full spectrum of interpretability methods for transformers. It begins with attention analysis and probing classifiers, which offer accessible entry points for understanding model internals. It then advances to mechanistic interpretability, the ambitious program of reverse-engineering neural networks at the level of individual features and circuits. The module also covers practical interpretability tools for debugging, model editing, and representation engineering, as well as formal attribution methods for explaining transformer predictions.

By the end of this module, you will be able to analyze attention patterns to understand model behavior, use probing classifiers to test what information is encoded in hidden states, apply sparse autoencoders to extract interpretable features, and employ attribution methods to explain individual predictions.

Learning Objectives

Visualize and interpret attention patterns, including induction heads, previous-token heads, and positional patterns
Design and train probing classifiers to test whether specific linguistic or semantic features are encoded in hidden states
Explain the logit lens and tuned lens techniques for inspecting intermediate representations
Describe the circuits and features framework for mechanistic interpretability and the role of sparse autoencoders
Apply activation patching to localize which model components are responsible for specific behaviors
Use TransformerLens and nnsight for hands-on mechanistic analysis of transformer models
Apply feature attribution methods (Integrated Gradients, SHAP) to explain individual predictions
Perform representation engineering and model editing (ROME, MEMIT) to modify specific model knowledge
Compare attention rollout, gradient-weighted attention, and perturbation-based explanation methods

Prerequisites

Module 06: The Transformer Architecture (multi-head attention, feed-forward layers, residual stream)
Module 07: Pretraining (next-token prediction, embedding spaces)
Module 05: Embeddings and Representation Learning (vector spaces, similarity)
Comfortable with PyTorch, including hooks, autograd, and tensor manipulation
Linear algebra fundamentals (matrix multiplication, eigendecomposition, SVD)

Chapter Overview

Learning Objectives

Prerequisites

Sections