Section 16.1: RLHF: Reinforcement Learning from Human Feedback

★ Big Picture

RLHF is the technique that turned GPT-3 into ChatGPT. A pretrained language model can generate fluent text, but it has no notion of helpfulness, safety, or user intent. RLHF introduces human judgment into the training loop: annotators compare model outputs, those comparisons train a reward model, and reinforcement learning steers the policy toward higher-reward behavior. This three-stage pipeline (SFT, reward modeling, PPO) became the standard approach for aligning large language models from 2022 onward, and understanding it is essential for grasping every subsequent alignment method.

1. The Alignment Problem

A pretrained language model optimizes a single objective: predict the next token. This objective produces remarkable capabilities in text generation, translation, summarization, and reasoning. However, next-token prediction does not inherently encode any preference for helpful, harmless, or honest behavior. A base model will happily complete a request for harmful content, generate fabricated citations, or produce verbose responses when a concise answer would be more useful.

The alignment problem is the challenge of bridging this gap: how do we take a capable base model and steer its behavior to match human intentions? Supervised fine-tuning (SFT) on curated instruction-response pairs provides a partial solution, teaching the model the format of helpful responses. But SFT alone cannot capture the full spectrum of human preferences, especially for subjective qualities like tone, level of detail, safety boundaries, and response style. RLHF addresses this limitation by using human preferences as a training signal.

2. The Three-Stage RLHF Pipeline

The canonical RLHF pipeline, as described in the InstructGPT paper (Ouyang et al., 2022), consists of three sequential stages. Each stage builds on the output of the previous one, and the entire pipeline transforms a pretrained base model into an aligned assistant.

Figure 16.1: The three-stage RLHF pipeline. Stage 1 produces an SFT model from instruction data. Stage 2 trains a reward model on human preferences. Stage 3 uses PPO to optimize the policy against the reward model while staying close to the SFT distribution.

2.1 Stage 1: Supervised Fine-Tuning (SFT)

The first stage takes a pretrained base model and fine-tunes it on a curated dataset of instruction-response pairs. This step teaches the model the basic format and style of a conversational assistant. The SFT dataset typically contains thousands to tens of thousands of high-quality demonstrations written by human annotators or distilled from stronger models.

SFT alone produces a functional assistant, but its quality is bounded by the demonstration data. The model learns to imitate the average quality of the training responses, which means it cannot exceed the skill level of the annotators. RLHF addresses this ceiling by replacing imitation with optimization toward a learned preference signal.

# Stage 1: Supervised Fine-Tuning with TRL
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

model_name = "meta-llama/Llama-3.1-8B"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load instruction-following dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

# Format conversations into chat template
def format_chat(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False
        )
    }

dataset = dataset.map(format_chat)

# Configure SFT training
sft_config = SFTConfig(
    output_dir="./sft-llama-8b",
    max_seq_length=2048,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-5,
    num_train_epochs=1,
    warmup_ratio=0.1,
    logging_steps=10,
    bf16=True,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./sft-llama-8b-final")

2.2 Stage 2: Reward Model Training

The reward model is the bridge between human judgment and machine optimization. It takes a prompt and a response as input and produces a scalar score indicating how good the response is according to human preferences. Training the reward model requires a dataset of pairwise comparisons: for each prompt, human annotators rank two or more candidate responses from best to worst.

The Bradley-Terry Preference Model

The standard approach models preferences using the Bradley-Terry framework. Given a prompt x and two responses y_w (preferred) and y_l (rejected), the probability of the human preferring y_w is modeled as:

P(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l))

where r(x, y) is the reward model's scalar output and σ is the sigmoid function. The reward model is trained to maximize the log-likelihood of observed human preferences:

L(r) = −E[log σ(r(x, y_w) − r(x, y_l))]

💡 Key Insight

The Bradley-Terry model only cares about the difference in rewards between two responses, not the absolute values. This means the reward model learns a relative ranking rather than an absolute quality score. A response with reward 5.0 is not inherently "good"; it is simply better than a response with reward 3.0 for the same prompt.

# Stage 2: Reward Model Training
from trl import RewardTrainer, RewardConfig
from transformers import AutoModelForSequenceClassification

# Initialize reward model from the SFT checkpoint
reward_model = AutoModelForSequenceClassification.from_pretrained(
    "./sft-llama-8b-final",
    num_labels=1,  # single scalar reward
)

# Load preference dataset (chosen / rejected pairs)
pref_dataset = load_dataset(
    "Anthropic/hh-rlhf", split="train"
)

# The dataset has 'chosen' and 'rejected' columns
# Each is a full conversation string
print(f"Training samples: {len(pref_dataset)}")
print(f"Example chosen:  {pref_dataset[0]['chosen'][:100]}...")
print(f"Example rejected: {pref_dataset[0]['rejected'][:100]}...")

# Configure reward model training
reward_config = RewardConfig(
    output_dir="./reward-model-llama-8b",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    num_train_epochs=1,
    max_length=2048,
    logging_steps=10,
    bf16=True,
    # Reward model specific
    remove_unused_columns=False,
)

reward_trainer = RewardTrainer(
    model=reward_model,
    args=reward_config,
    train_dataset=pref_dataset,
    tokenizer=tokenizer,
)

reward_trainer.train()
reward_trainer.save_model("./reward-model-llama-8b-final")

2.3 Stage 3: PPO (Proximal Policy Optimization)

The final stage uses reinforcement learning to optimize the SFT model (the "policy") against the reward model. For each training prompt, the policy generates a response, the reward model scores it, and PPO updates the policy weights to increase the expected reward. The critical addition is a KL divergence penalty that prevents the policy from straying too far from the original SFT distribution.

Figure 16.2: The PPO training loop. The policy generates a response, the reward model scores it, a KL penalty constrains drift from the reference model, and PPO updates the policy to maximize the total reward.

The PPO Objective for Language Models

The PPO objective for RLHF combines the reward model score with a KL divergence penalty:

J(θ) = E_{x~D, y~π_θ}[r(x, y) − β · KL(π_θ(y|x) || π_ref(y|x))]

The KL penalty serves two purposes. First, it prevents reward hacking, where the policy finds degenerate outputs that score highly on the reward model but are actually low quality (such as repeating specific phrases that the reward model happens to rate highly). Second, it preserves the general capabilities of the base model by keeping the policy close to the SFT distribution.

⚠ Warning

Without the KL penalty, PPO training almost always collapses. The policy quickly finds reward model exploits and produces repetitive, incoherent text that scores artificially high. The β coefficient must be tuned carefully: too low and the policy hacks the reward; too high and the policy barely moves from the SFT starting point.

# Stage 3: PPO Training with TRL
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
import torch

# Load the SFT model as the policy (with a value head for PPO)
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "./sft-llama-8b-final"
)

# The reference model is a frozen copy of the SFT model
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "./sft-llama-8b-final"
)

# Load the trained reward model
from transformers import pipeline
reward_pipe = pipeline(
    "text-classification",
    model="./reward-model-llama-8b-final",
    device_map="auto",
)

# PPO configuration
ppo_config = PPOConfig(
    output_dir="./ppo-llama-8b",
    learning_rate=1e-6,          # very small LR for stability
    batch_size=64,
    mini_batch_size=8,
    ppo_epochs=4,                # PPO epochs per batch
    kl_penalty="kl",
    init_kl_coef=0.2,           # initial beta for KL penalty
    target_kl=6.0,              # adaptive KL target
    gamma=1.0,
    lam=0.95,
    cliprange=0.2,              # PPO clipping
    log_with="wandb",
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=policy_model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
prompts_dataset = load_dataset("Anthropic/hh-rlhf", split="test")

for batch in ppo_trainer.dataloader:
    query_tensors = batch["input_ids"]

    # Generate responses from the current policy
    response_tensors = ppo_trainer.generate(
        query_tensors,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
    )

    # Score responses with the reward model
    texts = [tokenizer.decode(r) for r in response_tensors]
    rewards = [
        torch.tensor(reward_pipe(t)[0]["score"])
        for t in texts
    ]

    # PPO update step
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

3. Reward Model Architecture

The reward model is typically initialized from the same pretrained or SFT model. The key modification is replacing the language model head (which predicts next tokens) with a scalar head that produces a single reward value. In practice, this means adding a linear projection from the final hidden state to a single output neuron.

Design Choice	Common Approach	Notes
Initialization	From SFT checkpoint	Preserves language understanding from fine-tuning
Output head	Linear(hidden_dim, 1)	Projects last token hidden state to scalar
Pooling	Last token	For decoder-only models; CLS token for encoders
Training objective	Bradley-Terry pairwise loss	Log-sigmoid of reward difference
Size relative to policy	Same size or smaller	InstructGPT used 6B RM for 175B policy
Regularization	Dropout, weight decay, margin term	Prevents overfitting to annotator artifacts

📝 Note

The size of the reward model matters. A reward model that is too small will underfit human preferences and provide a noisy signal. One that is too large adds unnecessary compute cost. OpenAI's InstructGPT paper used a 6B-parameter reward model to train a 175B-parameter policy, demonstrating that the reward model does not need to match the policy size.

4. Process vs. Outcome Reward Models

Standard reward models (Outcome Reward Models, or ORMs) score the final response as a whole. This provides a single signal for the entire generation. An alternative approach, Process Reward Models (PRMs), scores each step of the reasoning process individually.

Figure 16.3: ORM provides a single score for the final answer. PRM provides per-step scores, enabling credit assignment to individual reasoning steps.

PRMs have shown significant advantages for mathematical reasoning tasks. OpenAI's "Let's Verify Step by Step" paper demonstrated that process supervision substantially outperforms outcome supervision for math problem solving. The key advantage is credit assignment: when a multi-step solution is wrong, a PRM can identify which specific step introduced the error, enabling more targeted training signals.

# Simplified Process Reward Model scoring
import torch
import torch.nn as nn

class ProcessRewardModel(nn.Module):
    """Scores each reasoning step individually."""

    def __init__(self, base_model, hidden_dim=4096):
        super().__init__()
        self.base_model = base_model
        self.step_scorer = nn.Linear(hidden_dim, 1)
        self.step_delimiter = "\n"  # steps separated by newlines

    def forward(self, input_ids, attention_mask, step_positions):
        """
        Args:
            input_ids: tokenized prompt + response
            attention_mask: standard attention mask
            step_positions: indices of tokens where each step ends
        Returns:
            step_rewards: reward score for each reasoning step
        """
        # Get hidden states from the base model
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        hidden_states = outputs.hidden_states[-1]  # last layer

        # Extract hidden states at step boundary positions
        step_hidden = hidden_states[:, step_positions, :]

        # Score each step
        step_rewards = self.step_scorer(step_hidden).squeeze(-1)
        return torch.sigmoid(step_rewards)

    def score_solution(self, input_ids, attention_mask, step_positions):
        """Return per-step and aggregate scores."""
        step_rewards = self.forward(input_ids, attention_mask, step_positions)
        aggregate = step_rewards.min(dim=-1).values  # worst step
        return {
            "step_rewards": step_rewards,
            "aggregate_reward": aggregate,
            "weakest_step": step_rewards.argmin(dim=-1),
        }

5. GRPO: Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO), introduced by DeepSeek, simplifies the RLHF pipeline by eliminating the need for a separate value model (the critic network in standard PPO). Instead of training a value network to estimate expected returns, GRPO samples a group of responses for each prompt and uses the group statistics to normalize rewards.

For each prompt, GRPO generates G responses, computes their rewards, and normalizes the rewards within the group to have zero mean and unit variance. This group-level normalization serves as the baseline that a traditional value network would provide. The advantage is significant: GRPO requires roughly half the GPU memory of standard PPO because it does not maintain a critic network.

💡 Key Insight

GRPO's core idea is simple but powerful: instead of learning to predict how good a response will be (the value function), just generate several responses and compare them. If you sample 8 responses to a math problem and 3 get the right answer, you know the correct ones should get positive advantage and the wrong ones negative advantage, without needing a learned critic.

# GRPO: Group Relative Policy Optimization (simplified)
import torch
import torch.nn.functional as F

def grpo_loss(
    policy_model,
    ref_model,
    prompts,
    tokenizer,
    reward_fn,
    group_size=8,
    beta=0.1,
    clip_range=0.2,
):
    """
    Simplified GRPO training step.

    For each prompt, generates a group of responses,
    normalizes rewards within the group, and computes
    the clipped policy gradient loss.
    """
    all_losses = []

    for prompt in prompts:
        # Generate a group of responses
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        responses = []
        for _ in range(group_size):
            output = policy_model.generate(
                input_ids, max_new_tokens=512, do_sample=True,
                temperature=0.8, top_p=0.95,
            )
            responses.append(output[0])

        # Compute rewards for each response
        rewards = torch.tensor([
            reward_fn(prompt, tokenizer.decode(r)) for r in responses
        ])

        # Group-level normalization (replaces value network)
        normalized_rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

        # Compute policy gradient with clipped objective
        for response, advantage in zip(responses, normalized_rewards):
            # Log probabilities under current and reference policy
            with torch.no_grad():
                ref_logprobs = compute_logprobs(ref_model, input_ids, response)
            policy_logprobs = compute_logprobs(policy_model, input_ids, response)

            # Importance ratio
            ratio = torch.exp(policy_logprobs - ref_logprobs)

            # Clipped surrogate objective (PPO-style)
            surr1 = ratio * advantage
            surr2 = torch.clamp(ratio, 1 - clip_range, 1 + clip_range) * advantage
            policy_loss = -torch.min(surr1, surr2).mean()

            # KL penalty
            kl = (ref_logprobs - policy_logprobs).mean()
            total_loss = policy_loss + beta * kl
            all_losses.append(total_loss)

    return torch.stack(all_losses).mean()

6. RLHF Infrastructure at Scale

Running RLHF at production scale is an infrastructure challenge that goes far beyond the algorithm itself. A full RLHF training run requires simultaneously managing four models: the policy model being trained, the reference model (a frozen copy), the reward model, and (in standard PPO) the value model. This quadruples the GPU memory requirements compared to standard SFT.

Component	Memory Cost	Compute Pattern
Policy model	Full model + optimizer states	Forward + backward pass
Reference model	Full model (frozen, inference only)	Forward pass only
Reward model	Full model (frozen, inference only)	Forward pass only
Value model (PPO)	Full model + optimizer states	Forward + backward pass
Generation buffer	KV cache for response generation	Autoregressive decoding

📝 Note

Frameworks like DeepSpeed-Chat, OpenRLHF, and TRL have developed specialized strategies for managing this multi-model workload. Common optimizations include offloading frozen models to CPU during gradient computation, sharing weights between the policy and value models, and using vLLM or other optimized inference engines for the generation phase.

⚠ Warning

RLHF training is notoriously unstable. Common failure modes include reward hacking (the policy exploits reward model weaknesses), mode collapse (the policy generates near-identical responses for all prompts), and KL explosion (the policy diverges rapidly from the reference). Monitoring KL divergence, reward statistics, and generation diversity during training is essential. If mean reward increases while KL also increases rapidly, the policy is likely hacking the reward model.

📝 Section Quiz

1. What is the primary purpose of the KL divergence penalty in PPO-based RLHF?

Show Answer

The KL divergence penalty prevents the policy from diverging too far from the reference (SFT) model. This serves two purposes: (1) it prevents reward hacking, where the policy finds degenerate outputs that exploit reward model weaknesses, and (2) it preserves the general language capabilities of the base model. Without KL regularization, PPO training typically collapses.

2. Why does the Bradley-Terry model only learn relative preferences rather than absolute quality?

Show Answer

The Bradley-Terry loss function depends only on the difference in reward scores between the chosen and rejected responses: L = -log(σ(r(y_w) - r(y_l))). Adding any constant to both rewards leaves the loss unchanged. This means the model learns a preference ordering (which response is better) rather than an absolute quality measure.

3. What advantage do Process Reward Models (PRMs) have over Outcome Reward Models (ORMs)?

Show Answer

PRMs provide per-step credit assignment. When a multi-step reasoning chain produces a wrong answer, a PRM can identify which specific step introduced the error. ORMs can only provide a single score for the entire response, making it difficult for the policy to determine where it went wrong. This granular feedback is particularly valuable for mathematical reasoning.

4. How does GRPO eliminate the need for a value network?

Show Answer

GRPO generates a group of G responses for each prompt, computes their rewards, and normalizes the rewards within the group (zero mean, unit variance). This group-level normalization serves the same purpose as a learned value baseline in standard PPO: it tells the algorithm which responses are above or below average. This eliminates roughly half the GPU memory required for training.

5. Why is RLHF at scale an infrastructure challenge beyond the algorithm itself?

Show Answer

Full RLHF requires managing four simultaneous models: the policy (trained), reference model (frozen), reward model (frozen), and value model (trained, in PPO). This quadruples GPU memory compared to SFT. Additionally, the training loop alternates between generation (autoregressive decoding, which needs KV cache) and gradient computation, requiring careful scheduling and memory management across the cluster.

✅ Key Takeaways

RLHF transforms base models into aligned assistants through a three-stage pipeline: SFT provides the instruction-following format, the reward model captures human preferences, and PPO optimizes the policy toward higher-reward behavior.
The Bradley-Terry preference model converts pairwise human comparisons into a scalar reward signal. It learns relative quality, not absolute quality.
The KL divergence penalty is essential for training stability. It prevents reward hacking and preserves general model capabilities.
Process Reward Models (PRMs) provide per-step feedback for reasoning tasks, enabling better credit assignment than outcome-only models.
GRPO simplifies PPO by replacing the learned value function with group-level reward normalization, cutting memory requirements roughly in half.
Production RLHF requires managing four models simultaneously, making infrastructure and memory management a first-class engineering concern.