1. The Reinforcement Learning Framework
Supervised learning needs labeled data: input X, correct output Y. But what if there is no single "correct" answer, only better and worse ones? What if the learner must try something, observe the consequence, and improve over time? That is reinforcement learning.
Think of training a dog. You say "sit." The dog tries something (maybe it lies down). You give a treat only when it actually sits. Over many repetitions, the dog learns which action earns the treat. This is exactly the RL loop: an agent (the dog) interacts with an environment (you and the room), takes actions (sitting, lying down), and receives rewards (treats or nothing).
Let us define each piece of the RL framework precisely:
| Concept | Definition | LLM Analogy |
|---|---|---|
| Agent | The learner and decision-maker | The language model |
| Environment | Everything the agent interacts with | The reward model + user prompt |
| State (s) | A snapshot of the current situation | The prompt + all tokens generated so far |
| Action (a) | A choice the agent makes at each step | Choosing the next token from the vocabulary |
| Reward (r) | A scalar signal indicating how good the action was | The reward model's score for the full response |
| Episode | One complete interaction from start to finish | Generating one complete response to a prompt |
2. Policies: The Agent's Strategy
A policy is the agent's strategy: a rule that maps each state to an action. It answers the question, "Given what I see right now, what should I do?"
Policies come in two flavors. A deterministic policy always picks the same action for a given state (if the dog sees a hand signal, it always sits). A stochastic policy assigns probabilities to each possible action, then samples from that distribution.
Formally, a stochastic policy is written as π(a | s), the probability of choosing action a when in state s. For an LLM, this is exactly the softmax output: the probability the model assigns to each token given the context so far. The entire goal of RL training is to adjust the parameters of π so that high-reward actions become more probable.
3. Value Functions and the Bellman Equation
Rewards tell us how good a single step was. But we need a way to evaluate long-term prospects. Is this state a good place to be? Is this action a wise choice? Value functions answer these questions.
State-Value Function V(s)
What: V(s) estimates the total future reward the agent expects to accumulate starting from state s and following its current policy onward. Why it matters: It lets the agent judge whether its current situation is promising or dire. How it works: Think of it as a "mood meter." A high V(s) means "things are going well from here"; a low V(s) means "trouble ahead."
Action-Value Function Q(s, a)
What: Q(s, a) estimates the total future reward if the agent takes action a in state s and then follows its policy. Why it matters: It lets the agent compare actions and pick the best one. How it works: If V(s) is the mood meter, Q(s, a) is like asking, "If I take this specific action right now, will things go better or worse than average?"
The Bellman Equation (Intuition)
The Bellman equation expresses a simple but powerful recursive idea: the value of a state equals the immediate reward plus the (discounted) value of the next state. In plain English: "How good is it to be here? Well, how much reward do I get right now, plus how good is the place I end up?"
This recursion is the engine behind nearly all RL algorithms. We do not need to derive it formally for our purposes; the intuition is what matters. A discount factor γ (between 0 and 1) controls how much the agent values future rewards relative to immediate ones. When γ is close to 1, the agent is patient and plans ahead. When γ is close to 0, it is greedy.
4. Policy Gradients: Learning by Trial and Feedback
Now we arrive at the core question: how does the agent improve its policy? The answer is the policy gradient theorem, and the intuition is remarkably straightforward.
Imagine the dog tries ten different behaviors. Three of them earned treats. The training strategy is simple: make those three behaviors more likely in the future. Behaviors that earned nothing (or a scolding) become less likely. That is the entire idea behind policy gradients.
More precisely: the agent samples actions from its policy, observes the rewards, and then adjusts its parameters (the neural network weights) to increase the probability of actions that led to high rewards and decrease the probability of actions that led to low rewards. The magnitude of each adjustment is proportional to the reward received.
Code Example 1: A Minimal RL Environment (Grid World)
import numpy as np
class SimpleGridWorld:
"""A 4x4 grid where an agent must reach the goal.
State: (row, col) position. Actions: 0=up, 1=right, 2=down, 3=left.
Reward: +1 for reaching the goal, -0.01 per step (to encourage speed).
LLM analogy: each 'step' is like generating one token, and the final
reward scores the complete trajectory (the full response)."""
def __init__(self, size=4):
self.size = size
self.goal = (size - 1, size - 1)
self.reset()
def reset(self):
self.pos = (0, 0)
return self.pos
def step(self, action):
r, c = self.pos
if action == 0: r = max(r - 1, 0) # up
elif action == 1: c = min(c + 1, self.size - 1) # right
elif action == 2: r = min(r + 1, self.size - 1) # down
elif action == 3: c = max(c - 1, 0) # left
self.pos = (r, c)
done = (self.pos == self.goal)
reward = 1.0 if done else -0.01
return self.pos, reward, done
# Run one episode with a random policy
env = SimpleGridWorld()
state = env.reset()
total_reward = 0
for step in range(100):
action = np.random.randint(4) # random stochastic policy
state, reward, done = env.step(action)
total_reward += reward
if done:
break
print(f"Episode finished in {step + 1} steps, total reward: {total_reward:.2f}")
Code Example 2: REINFORCE Policy Gradient Sketch (PyTorch)
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
class PolicyNetwork(nn.Module):
"""A small neural network that maps states to action probabilities.
In an LLM, this role is played by the entire transformer: it maps
the token sequence (state) to a distribution over the next token (action)."""
def __init__(self, state_dim, n_actions, hidden=64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden),
nn.ReLU(),
nn.Linear(hidden, n_actions),
nn.Softmax(dim=-1)
)
def forward(self, state):
return self.net(state)
# Simplified REINFORCE training loop
policy = PolicyNetwork(state_dim=2, n_actions=4)
optimizer = optim.Adam(policy.parameters(), lr=1e-3)
def reinforce_episode(env, policy):
"""Collect one episode and update the policy.
Core idea: increase probability of actions that led to positive reward,
decrease probability of actions that led to negative reward."""
state = env.reset()
log_probs = []
rewards = []
# 1. Roll out an episode using the current policy
for _ in range(200):
state_tensor = torch.FloatTensor(state)
probs = policy(state_tensor)
dist = Categorical(probs)
action = dist.sample() # stochastic action selection
log_probs.append(dist.log_prob(action))
state, reward, done = env.step(action.item())
rewards.append(reward)
if done:
break
# 2. Compute discounted returns (what was the total reward from each step?)
returns = []
G = 0
for r in reversed(rewards):
G = r + 0.99 * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# 3. Policy gradient: nudge probabilities toward high-reward actions
loss = -sum(lp * R for lp, R in zip(log_probs, returns))
optimizer.zero_grad()
loss.backward()
optimizer.step()
return sum(rewards)
Code Example 3: Complete Runnable REINFORCE on a Simple Task
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import random
# Simple environment: agent must learn to pick action 2 (out of 0,1,2,3)
# Reward = +1 if action == 2, else -0.1
class SimpleEnv:
def reset(self):
return [random.random(), random.random()] # random 2D state
def step(self, action):
reward = 1.0 if action == 2 else -0.1
done = True # single-step episodes for clarity
return self.reset(), reward, done
policy = nn.Sequential(nn.Linear(2, 32), nn.ReLU(), nn.Linear(32, 4), nn.Softmax(dim=-1))
optimizer = optim.Adam(policy.parameters(), lr=3e-3)
env = SimpleEnv()
for episode in range(1, 1001):
state = torch.FloatTensor(env.reset())
probs = policy(state)
dist = Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
_, reward, _ = env.step(action.item())
loss = -log_prob * reward
optimizer.zero_grad()
loss.backward()
optimizer.step()
if episode % 200 == 0:
# Evaluate: check how often the policy picks action 2
correct = sum(1 for _ in range(100)
if policy(torch.FloatTensor(env.reset())).argmax().item() == 2)
print(f"Episode {episode:4d} | Last reward: {reward:+.1f} | Action 2 rate: {correct}%")
- Change the target action from 2 to 0. Does the policy learn equally fast?
- Reduce the learning rate to
1e-4. How many episodes does it take to reach 90%? - Change the reward for wrong actions from
-0.1to0.0. What happens and why?
5. PPO: Stable Policy Updates
Proximal Policy Optimization (PPO) solves a critical problem with basic policy gradients: if a single update changes the policy too drastically, performance can collapse and never recover. PPO prevents this by clipping the update, ensuring each step is small and safe.
Here is the intuition. Imagine you are adjusting a recipe. You taste the soup, decide it needs more salt, and add some. With vanilla policy gradients, nothing stops you from dumping in the entire salt shaker. PPO is the rule that says: "Never add more than one pinch at a time." You can always taste again and add another pinch, but you cannot ruin the soup with a single reckless change.
Technically, PPO computes a ratio between the new policy's probability of an action and the old policy's probability. If this ratio drifts too far from 1.0 (typically beyond a range of 0.8 to 1.2), the gradient is clipped to zero. The policy still improves, but it cannot make a catastrophically large jump in a single step.
6. From RL to LLM Training: The Complete Picture
We have built up the vocabulary piece by piece. Now let us assemble the full picture of how reinforcement learning powers LLM alignment. The mapping is direct and concrete:
The RLHF training pipeline works as follows:
- Supervised fine-tuning (SFT): First, the LLM is fine-tuned on high-quality human demonstrations. This gives the policy a good starting point.
- Reward model training: Human annotators rank multiple model outputs for the same prompt. A separate neural network (the reward model) is trained to predict these human preferences.
- RL optimization with PPO: The LLM generates responses to prompts. The reward model scores each response. PPO uses these scores to update the LLM's weights, nudging the probability distribution toward responses that score highly.
A critical detail: during the PPO phase, a KL penalty prevents the RL-trained model from drifting too far from the SFT model. Without this constraint, the model might find degenerate responses that exploit the reward model (for example, generating repetitive flattery that scores highly but is useless). The KL penalty keeps the model close to its pre-trained language abilities.
Recent research has explored an alternative to training a separate reward model: using verifiable rewards instead. In RLVR, the reward signal comes from an objective, automated check rather than a learned model. For example, in math reasoning tasks, the reward is simply whether the model's final answer matches the known correct answer. In coding tasks, the reward is whether the generated code passes a test suite.
RLVR eliminates the reward hacking problem entirely for domains where correctness can be verified. DeepSeek-R1 (2025) demonstrated that RL with verifiable rewards could dramatically improve mathematical reasoning capabilities. The limitation is that RLVR only works when you can automatically verify the output, which is not possible for open-ended tasks like creative writing or nuanced conversation. We will explore both RLHF and RLVR in detail in Module 16.
7. Putting It All Together
Let us trace the full journey from RL foundations to LLM training. An LLM begins as a pretrained model that can generate fluent text but may produce harmful, incorrect, or unhelpful responses. We want to steer it toward better behavior.
We frame this as an RL problem. The LLM is the agent (the policy). Each prompt starts a new episode. At each time step, the model picks a token (an action) based on its current context (the state). When the response is complete, the reward model assigns a score (the reward). PPO adjusts the model's weights so that high-scoring response patterns become more probable, while the clipping mechanism and KL penalty ensure the model does not lose its fluency or degenerate into reward hacking.
The dog analogy still holds at this scale. The dog does not understand English; it learns by trying actions and receiving treats. The LLM does not "understand" human values; it learns which outputs are valued by observing reward signals. The elegance of RL is that this simple loop of action, feedback, and adjustment can produce remarkably sophisticated behavior.
Check Your Understanding
1. In the RLHF framework for LLM training, what plays the role of the RL "action"?
Show Answer
2. Why does PPO use clipping in its objective function?
Show Answer
3. What is "reward hacking" in the context of RLHF?
Show Answer
Key Takeaways
- RL is a learning paradigm where an agent improves through trial and feedback, not labeled examples. The core loop is: observe state, take action, receive reward, update policy.
- An LLM is a policy that maps a context (state) to a probability distribution over tokens (actions). RLHF uses this mapping directly.
- Value functions (V and Q) estimate long-term expected reward. The Bellman equation gives them a recursive structure.
- Policy gradients adjust the policy to make high-reward actions more probable. This is the mechanism that steers LLM outputs toward human preferences.
- PPO stabilizes training by clipping updates, preventing catastrophic changes. It is the standard RL optimizer for RLHF.
- Reward hacking is the central risk: the LLM may exploit an imperfect reward model. KL penalties and verifiable rewards (RLVR) are mitigations.
- These foundations are prerequisites for Module 16, where we will implement RLHF, DPO, and RLVR for real language models.
What Comes Next
With ML fundamentals, deep learning, PyTorch, and reinforcement learning now in your toolkit, you have completed Module 00. In Module 01: Foundations of NLP and Text Representation, we shift from general machine learning to the specific domain of language. You will learn how text is converted into numbers (tokenization and embeddings), how classical NLP techniques work, and why representing words as vectors was one of the most transformative ideas in the field. These representations are the raw material that transformers (Module 04) will learn to process.