Part IV: Training & Adapting LLMs
Pretraining and supervised fine-tuning produce capable language models, but raw capability is not the same as usefulness or safety. Alignment is the process of steering an LLM's behavior so that it follows instructions, produces helpful responses, avoids harmful outputs, and generally reflects human preferences. Without alignment, even the most powerful base model may generate toxic, incoherent, or off-topic text.
This module covers the full landscape of preference-based alignment methods. It begins with RLHF, the technique that powered ChatGPT's breakthrough, walking through the three-stage pipeline of supervised fine-tuning, reward modeling, and proximal policy optimization. It then explores modern alternatives like Direct Preference Optimization (DPO) that eliminate the need for a separate reward model, Constitutional AI for scalable self-alignment, and Reinforcement Learning with Verifiable Rewards (RLVR) for training reasoning capabilities.
By the end of this module, you will understand the theoretical foundations and practical engineering of each alignment family, know when to choose one method over another, and be able to implement preference tuning pipelines using current open-source tooling.