RLHF is the technique that turned GPT-3 into ChatGPT. A pretrained language model can generate fluent text, but it has no notion of helpfulness, safety, or user intent. RLHF introduces human judgment into the training loop: annotators compare model outputs, those comparisons train a reward model, and reinforcement learning steers the policy toward higher-reward behavior. This three-stage pipeline (SFT, reward modeling, PPO) became the standard approach for aligning large language models from 2022 onward, and understanding it is essential for grasping every subsequent alignment method.
1. The Alignment Problem
A pretrained language model optimizes a single objective: predict the next token. This objective produces remarkable capabilities in text generation, translation, summarization, and reasoning. However, next-token prediction does not inherently encode any preference for helpful, harmless, or honest behavior. A base model will happily complete a request for harmful content, generate fabricated citations, or produce verbose responses when a concise answer would be more useful.
The alignment problem is the challenge of bridging this gap: how do we take a capable base model and steer its behavior to match human intentions? Supervised fine-tuning (SFT) on curated instruction-response pairs provides a partial solution, teaching the model the format of helpful responses. But SFT alone cannot capture the full spectrum of human preferences, especially for subjective qualities like tone, level of detail, safety boundaries, and response style. RLHF addresses this limitation by using human preferences as a training signal.
2. The Three-Stage RLHF Pipeline
The canonical RLHF pipeline, as described in the InstructGPT paper (Ouyang et al., 2022), consists of three sequential stages. Each stage builds on the output of the previous one, and the entire pipeline transforms a pretrained base model into an aligned assistant.
2.1 Stage 1: Supervised Fine-Tuning (SFT)
The first stage takes a pretrained base model and fine-tunes it on a curated dataset of instruction-response pairs. This step teaches the model the basic format and style of a conversational assistant. The SFT dataset typically contains thousands to tens of thousands of high-quality demonstrations written by human annotators or distilled from stronger models.
SFT alone produces a functional assistant, but its quality is bounded by the demonstration data. The model learns to imitate the average quality of the training responses, which means it cannot exceed the skill level of the annotators. RLHF addresses this ceiling by replacing imitation with optimization toward a learned preference signal.
# Stage 1: Supervised Fine-Tuning with TRL
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load instruction-following dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
# Format conversations into chat template
def format_chat(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"], tokenize=False
)
}
dataset = dataset.map(format_chat)
# Configure SFT training
sft_config = SFTConfig(
output_dir="./sft-llama-8b",
max_seq_length=2048,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=2e-5,
num_train_epochs=1,
warmup_ratio=0.1,
logging_steps=10,
bf16=True,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./sft-llama-8b-final")
2.2 Stage 2: Reward Model Training
The reward model is the bridge between human judgment and machine optimization. It takes a prompt and a response as input and produces a scalar score indicating how good the response is according to human preferences. Training the reward model requires a dataset of pairwise comparisons: for each prompt, human annotators rank two or more candidate responses from best to worst.
The Bradley-Terry Preference Model
The standard approach models preferences using the Bradley-Terry framework. Given a prompt x
and two responses y_w (preferred) and y_l (rejected), the probability of the
human preferring y_w is modeled as:
P(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l))
where r(x, y) is the reward model's scalar output and σ is the sigmoid function. The
reward model is trained to maximize the log-likelihood of observed human preferences:
L(r) = −E[log σ(r(x, y_w) − r(x, y_l))]
The Bradley-Terry model only cares about the difference in rewards between two responses, not the absolute values. This means the reward model learns a relative ranking rather than an absolute quality score. A response with reward 5.0 is not inherently "good"; it is simply better than a response with reward 3.0 for the same prompt.
# Stage 2: Reward Model Training
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification
# Initialize reward model from the SFT checkpoint
reward_model = AutoModelForSequenceClassification.from_pretrained(
"./sft-llama-8b-final",
num_labels=1, # single scalar reward
)
# Load preference dataset (chosen / rejected pairs)
pref_dataset = load_dataset(
"Anthropic/hh-rlhf", split="train"
)
# The dataset has 'chosen' and 'rejected' columns
# Each is a full conversation string
print(f"Training samples: {len(pref_dataset)}")
print(f"Example chosen: {pref_dataset[0]['chosen'][:100]}...")
print(f"Example rejected: {pref_dataset[0]['rejected'][:100]}...")
# Configure reward model training
reward_config = RewardConfig(
output_dir="./reward-model-llama-8b",
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=1e-5,
num_train_epochs=1,
max_length=2048,
logging_steps=10,
bf16=True,
# Reward model specific
remove_unused_columns=False,
)
reward_trainer = RewardTrainer(
model=reward_model,
args=reward_config,
train_dataset=pref_dataset,
tokenizer=tokenizer,
)
reward_trainer.train()
reward_trainer.save_model("./reward-model-llama-8b-final")
2.3 Stage 3: PPO (Proximal Policy Optimization)
The final stage uses reinforcement learning to optimize the SFT model (the "policy") against the reward model. For each training prompt, the policy generates a response, the reward model scores it, and PPO updates the policy weights to increase the expected reward. The critical addition is a KL divergence penalty that prevents the policy from straying too far from the original SFT distribution.
The PPO Objective for Language Models
The PPO objective for RLHF combines the reward model score with a KL divergence penalty:
J(θ) = Ex~D, y~πθ[r(x, y) − β · KL(πθ(y|x) || πref(y|x))]
The KL penalty serves two purposes. First, it prevents reward hacking, where the policy finds degenerate outputs that score highly on the reward model but are actually low quality (such as repeating specific phrases that the reward model happens to rate highly). Second, it preserves the general capabilities of the base model by keeping the policy close to the SFT distribution.
Without the KL penalty, PPO training almost always collapses. The policy quickly finds reward model exploits and produces repetitive, incoherent text that scores artificially high. The β coefficient must be tuned carefully: too low and the policy hacks the reward; too high and the policy barely moves from the SFT starting point.
# Stage 3: PPO Training with TRL
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
import torch
# Load the SFT model as the policy (with a value head for PPO)
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
"./sft-llama-8b-final"
)
# The reference model is a frozen copy of the SFT model
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
"./sft-llama-8b-final"
)
# Load the trained reward model
from transformers import pipeline
reward_pipe = pipeline(
"text-classification",
model="./reward-model-llama-8b-final",
device_map="auto",
)
# PPO configuration
ppo_config = PPOConfig(
output_dir="./ppo-llama-8b",
learning_rate=1e-6, # very small LR for stability
batch_size=64,
mini_batch_size=8,
ppo_epochs=4, # PPO epochs per batch
kl_penalty="kl",
init_kl_coef=0.2, # initial beta for KL penalty
target_kl=6.0, # adaptive KL target
gamma=1.0,
lam=0.95,
cliprange=0.2, # PPO clipping
log_with="wandb",
)
ppo_trainer = PPOTrainer(
config=ppo_config,
model=policy_model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
prompts_dataset = load_dataset("Anthropic/hh-rlhf", split="test")
for batch in ppo_trainer.dataloader:
query_tensors = batch["input_ids"]
# Generate responses from the current policy
response_tensors = ppo_trainer.generate(
query_tensors,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
)
# Score responses with the reward model
texts = [tokenizer.decode(r) for r in response_tensors]
rewards = [
torch.tensor(reward_pipe(t)[0]["score"])
for t in texts
]
# PPO update step
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
ppo_trainer.log_stats(stats, batch, rewards)
3. Reward Model Architecture
The reward model is typically initialized from the same pretrained or SFT model. The key modification is replacing the language model head (which predicts next tokens) with a scalar head that produces a single reward value. In practice, this means adding a linear projection from the final hidden state to a single output neuron.
| Design Choice | Common Approach | Notes |
|---|---|---|
| Initialization | From SFT checkpoint | Preserves language understanding from fine-tuning |
| Output head | Linear(hidden_dim, 1) | Projects last token hidden state to scalar |
| Pooling | Last token | For decoder-only models; CLS token for encoders |
| Training objective | Bradley-Terry pairwise loss | Log-sigmoid of reward difference |
| Size relative to policy | Same size or smaller | InstructGPT used 6B RM for 175B policy |
| Regularization | Dropout, weight decay, margin term | Prevents overfitting to annotator artifacts |
The size of the reward model matters. A reward model that is too small will underfit human preferences and provide a noisy signal. One that is too large adds unnecessary compute cost. OpenAI's InstructGPT paper used a 6B-parameter reward model to train a 175B-parameter policy, demonstrating that the reward model does not need to match the policy size.
4. Process vs. Outcome Reward Models
Standard reward models (Outcome Reward Models, or ORMs) score the final response as a whole. This provides a single signal for the entire generation. An alternative approach, Process Reward Models (PRMs), scores each step of the reasoning process individually.
PRMs have shown significant advantages for mathematical reasoning tasks. OpenAI's "Let's Verify Step by Step" paper demonstrated that process supervision substantially outperforms outcome supervision for math problem solving. The key advantage is credit assignment: when a multi-step solution is wrong, a PRM can identify which specific step introduced the error, enabling more targeted training signals.
# Simplified Process Reward Model scoring
import torch
import torch.nn as nn
class ProcessRewardModel(nn.Module):
"""Scores each reasoning step individually."""
def __init__(self, base_model, hidden_dim=4096):
super().__init__()
self.base_model = base_model
self.step_scorer = nn.Linear(hidden_dim, 1)
self.step_delimiter = "\n" # steps separated by newlines
def forward(self, input_ids, attention_mask, step_positions):
"""
Args:
input_ids: tokenized prompt + response
attention_mask: standard attention mask
step_positions: indices of tokens where each step ends
Returns:
step_rewards: reward score for each reasoning step
"""
# Get hidden states from the base model
outputs = self.base_model(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True,
)
hidden_states = outputs.hidden_states[-1] # last layer
# Extract hidden states at step boundary positions
step_hidden = hidden_states[:, step_positions, :]
# Score each step
step_rewards = self.step_scorer(step_hidden).squeeze(-1)
return torch.sigmoid(step_rewards)
def score_solution(self, input_ids, attention_mask, step_positions):
"""Return per-step and aggregate scores."""
step_rewards = self.forward(input_ids, attention_mask, step_positions)
aggregate = step_rewards.min(dim=-1).values # worst step
return {
"step_rewards": step_rewards,
"aggregate_reward": aggregate,
"weakest_step": step_rewards.argmin(dim=-1),
}
5. GRPO: Group Relative Policy Optimization
Group Relative Policy Optimization (GRPO), introduced by DeepSeek, simplifies the RLHF pipeline by eliminating the need for a separate value model (the critic network in standard PPO). Instead of training a value network to estimate expected returns, GRPO samples a group of responses for each prompt and uses the group statistics to normalize rewards.
For each prompt, GRPO generates G responses, computes their rewards, and normalizes the rewards within the group to have zero mean and unit variance. This group-level normalization serves as the baseline that a traditional value network would provide. The advantage is significant: GRPO requires roughly half the GPU memory of standard PPO because it does not maintain a critic network.
GRPO's core idea is simple but powerful: instead of learning to predict how good a response will be (the value function), just generate several responses and compare them. If you sample 8 responses to a math problem and 3 get the right answer, you know the correct ones should get positive advantage and the wrong ones negative advantage, without needing a learned critic.
# GRPO: Group Relative Policy Optimization (simplified)
import torch
import torch.nn.functional as F
def grpo_loss(
policy_model,
ref_model,
prompts,
tokenizer,
reward_fn,
group_size=8,
beta=0.1,
clip_range=0.2,
):
"""
Simplified GRPO training step.
For each prompt, generates a group of responses,
normalizes rewards within the group, and computes
the clipped policy gradient loss.
"""
all_losses = []
for prompt in prompts:
# Generate a group of responses
input_ids = tokenizer.encode(prompt, return_tensors="pt")
responses = []
for _ in range(group_size):
output = policy_model.generate(
input_ids, max_new_tokens=512, do_sample=True,
temperature=0.8, top_p=0.95,
)
responses.append(output[0])
# Compute rewards for each response
rewards = torch.tensor([
reward_fn(prompt, tokenizer.decode(r)) for r in responses
])
# Group-level normalization (replaces value network)
normalized_rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# Compute policy gradient with clipped objective
for response, advantage in zip(responses, normalized_rewards):
# Log probabilities under current and reference policy
with torch.no_grad():
ref_logprobs = compute_logprobs(ref_model, input_ids, response)
policy_logprobs = compute_logprobs(policy_model, input_ids, response)
# Importance ratio
ratio = torch.exp(policy_logprobs - ref_logprobs)
# Clipped surrogate objective (PPO-style)
surr1 = ratio * advantage
surr2 = torch.clamp(ratio, 1 - clip_range, 1 + clip_range) * advantage
policy_loss = -torch.min(surr1, surr2).mean()
# KL penalty
kl = (ref_logprobs - policy_logprobs).mean()
total_loss = policy_loss + beta * kl
all_losses.append(total_loss)
return torch.stack(all_losses).mean()
6. RLHF Infrastructure at Scale
Running RLHF at production scale is an infrastructure challenge that goes far beyond the algorithm itself. A full RLHF training run requires simultaneously managing four models: the policy model being trained, the reference model (a frozen copy), the reward model, and (in standard PPO) the value model. This quadruples the GPU memory requirements compared to standard SFT.
| Component | Memory Cost | Compute Pattern |
|---|---|---|
| Policy model | Full model + optimizer states | Forward + backward pass |
| Reference model | Full model (frozen, inference only) | Forward pass only |
| Reward model | Full model (frozen, inference only) | Forward pass only |
| Value model (PPO) | Full model + optimizer states | Forward + backward pass |
| Generation buffer | KV cache for response generation | Autoregressive decoding |
Frameworks like DeepSpeed-Chat, OpenRLHF, and TRL have developed specialized strategies for managing this multi-model workload. Common optimizations include offloading frozen models to CPU during gradient computation, sharing weights between the policy and value models, and using vLLM or other optimized inference engines for the generation phase.
RLHF training is notoriously unstable. Common failure modes include reward hacking (the policy exploits reward model weaknesses), mode collapse (the policy generates near-identical responses for all prompts), and KL explosion (the policy diverges rapidly from the reference). Monitoring KL divergence, reward statistics, and generation diversity during training is essential. If mean reward increases while KL also increases rapidly, the policy is likely hacking the reward model.
📝 Section Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
✅ Key Takeaways
- RLHF transforms base models into aligned assistants through a three-stage pipeline: SFT provides the instruction-following format, the reward model captures human preferences, and PPO optimizes the policy toward higher-reward behavior.
- The Bradley-Terry preference model converts pairwise human comparisons into a scalar reward signal. It learns relative quality, not absolute quality.
- The KL divergence penalty is essential for training stability. It prevents reward hacking and preserves general model capabilities.
- Process Reward Models (PRMs) provide per-step feedback for reasoning tasks, enabling better credit assignment than outcome-only models.
- GRPO simplifies PPO by replacing the learned value function with group-level reward normalization, cutting memory requirements roughly in half.
- Production RLHF requires managing four models simultaneously, making infrastructure and memory management a first-class engineering concern.