Module 16: Alignment: RLHF, DPO & Preference Tuning

Chapter Overview

Pretraining and supervised fine-tuning produce capable language models, but raw capability is not the same as usefulness or safety. Alignment is the process of steering an LLM's behavior so that it follows instructions, produces helpful responses, avoids harmful outputs, and generally reflects human preferences. Without alignment, even the most powerful base model may generate toxic, incoherent, or off-topic text.

This module covers the full landscape of preference-based alignment methods. It begins with RLHF, the technique that powered ChatGPT's breakthrough, walking through the three-stage pipeline of supervised fine-tuning, reward modeling, and proximal policy optimization. It then explores modern alternatives like Direct Preference Optimization (DPO) that eliminate the need for a separate reward model, Constitutional AI for scalable self-alignment, and Reinforcement Learning with Verifiable Rewards (RLVR) for training reasoning capabilities.

By the end of this module, you will understand the theoretical foundations and practical engineering of each alignment family, know when to choose one method over another, and be able to implement preference tuning pipelines using current open-source tooling.

Learning Objectives

Explain the three-stage RLHF pipeline (SFT, reward model training, PPO) and the role of each component
Describe how the Bradley-Terry preference model converts pairwise comparisons into a scalar reward signal
Derive the DPO objective from the RLHF formulation and explain why it eliminates the reward model
Compare DPO, KTO, ORPO, SimPO, and IPO in terms of data requirements, training stability, and performance
Implement preference tuning pipelines using TRL, including dataset preparation and hyperparameter selection
Explain Constitutional AI and RLAIF as approaches to scalable, principle-based alignment
Describe how RLVR uses verifiable rewards (math, code correctness) to train reasoning without human labels
Analyze the GRPO algorithm and its role in DeepSeek-R1 and similar reasoning-focused models

Prerequisites

Module 13: Fine-tuning Foundations (SFT, LoRA, training loops)
Module 06: The Transformer Architecture (attention, decoder-only models)
Module 07: Pretraining (next-token prediction, loss functions)
Basic understanding of reinforcement learning concepts (policy, reward, optimization)
Familiarity with PyTorch training loops and the Hugging Face ecosystem

Chapter Overview

Learning Objectives

Prerequisites

Sections