RLVR removes humans from the reward loop entirely by using objectively verifiable correctness signals. For domains like mathematics, programming, and formal proofs, the correctness of an answer can be checked automatically: a math solution is either right or wrong, code either passes tests or fails, and a proof either verifies or does not. RLVR exploits this property to train reasoning models at massive scale without any human preference data. This paradigm powered DeepSeek-R1 and has sparked an open ecosystem of reasoning models that achieve frontier-level performance on mathematical and coding benchmarks.
1. The Verifiable Reward Paradigm
Standard RLHF relies on a learned reward model trained on human preferences. This reward model is imperfect: it can be gamed, it introduces noise, and it caps the alignment quality at the level of annotator agreement. For domains where correctness is objectively verifiable, we can bypass the reward model entirely and use the ground truth as the reward signal.
The key insight behind RLVR is simple: if you can write a function that checks whether an answer is correct, you have a perfect reward signal. No reward model training, no human annotation, no preference noise. The reward is binary (correct or incorrect) or graded (partially correct), and it is always accurate.
1.1 Types of Verifiable Rewards
| Domain | Verification Method | Reward Type | Dataset Examples |
|---|---|---|---|
| Mathematics | Compare to known answer; symbolic checking | Binary (correct/wrong) | GSM8K, MATH, AIME |
| Code generation | Execute against test suite | Graded (pass@k tests) | HumanEval, MBPP, SWE-bench |
| Formal proofs | Lean/Coq/Isabelle type checker | Binary (verifies/fails) | miniF2F, ProofNet |
| Logic puzzles | Constraint satisfaction check | Binary (satisfies/violates) | FOLIO, ProntoQA |
| Format compliance | Regex, JSON schema validation | Binary (valid/invalid) | Custom structured output tasks |
# Verifiable reward functions for different domains
import subprocess
import json
import re
from typing import Optional
def math_reward(
model_answer: str,
ground_truth: str,
tolerance: float = 1e-6,
) -> float:
"""Binary reward for math: 1.0 if correct, 0.0 otherwise."""
# Extract numerical answer from model output
extracted = extract_final_answer(model_answer)
if extracted is None:
return 0.0
try:
model_val = float(extracted)
truth_val = float(ground_truth)
return 1.0 if abs(model_val - truth_val) < tolerance else 0.0
except ValueError:
# String comparison for symbolic answers
return 1.0 if extracted.strip() == ground_truth.strip() else 0.0
def code_reward(
generated_code: str,
test_cases: list,
timeout: int = 10,
) -> float:
"""Graded reward for code: fraction of tests passed."""
passed = 0
for test in test_cases:
full_code = generated_code + "\n" + test["test_code"]
try:
result = subprocess.run(
["python", "-c", full_code],
capture_output=True,
timeout=timeout,
text=True,
)
if result.returncode == 0:
passed += 1
except subprocess.TimeoutExpired:
continue
return passed / len(test_cases) if test_cases else 0.0
def proof_reward(
proof_text: str,
theorem_statement: str,
lean_project_path: str,
) -> float:
"""Binary reward for Lean 4 proofs."""
# Write proof to a temporary Lean file
lean_code = f"{theorem_statement}\n{proof_text}"
with open(f"{lean_project_path}/Temp.lean", "w") as f:
f.write(lean_code)
result = subprocess.run(
["lake", "build", "Temp"],
capture_output=True,
cwd=lean_project_path,
timeout=60,
)
return 1.0 if result.returncode == 0 else 0.0
def extract_final_answer(text: str) -> Optional[str]:
"""Extract boxed answer from math response."""
# Look for \boxed{...} or "The answer is ..."
boxed = re.findall(r"\\boxed\{([^}]+)\}", text)
if boxed:
return boxed[-1]
answer_pattern = re.findall(
r"(?:the answer is|therefore|thus)[:\s]+([^\n.]+)",
text, re.IGNORECASE,
)
if answer_pattern:
return answer_pattern[-1].strip()
return None
2. The GRPO Algorithm for Reasoning
Group Relative Policy Optimization (GRPO), introduced in Section 16.1, becomes especially powerful when combined with verifiable rewards. For math problems, GRPO generates a group of solutions, checks each one against the ground truth, and uses the binary outcomes to compute advantages. Solutions that reach the correct answer get positive advantage; those that fail get negative advantage. No reward model is needed.
The combination of GRPO + verifiable rewards creates a powerful self-improvement loop. The model learns which reasoning patterns lead to correct answers simply by generating many attempts and observing which ones succeed. This is analogous to how a student improves at math by solving many problems and checking the answer key, without needing a teacher to grade the work.
# GRPO with Verifiable Rewards for Math Reasoning
import torch
from typing import List, Callable
def grpo_math_training_step(
policy_model,
ref_model,
tokenizer,
math_problems: List[dict], # {"prompt": ..., "answer": ...}
group_size: int = 16,
beta: float = 0.04,
clip_range: float = 0.2,
):
"""
One GRPO training step using verifiable math rewards.
For each problem, generate group_size solutions,
check correctness, normalize rewards within the group,
and compute the policy gradient.
"""
total_loss = 0.0
total_correct = 0
total_generated = 0
for problem in math_problems:
prompt = problem["prompt"]
ground_truth = problem["answer"]
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# Generate a group of solutions
solutions = []
for _ in range(group_size):
output = policy_model.generate(
input_ids,
max_new_tokens=1024,
do_sample=True,
temperature=0.7,
top_p=0.95,
)
solutions.append(output[0])
# Compute verifiable rewards (binary: correct or not)
rewards = torch.tensor([
math_reward(tokenizer.decode(s), ground_truth)
for s in solutions
])
total_correct += rewards.sum().item()
total_generated += group_size
# Skip if all correct or all wrong (no learning signal)
if rewards.sum() == 0 or rewards.sum() == group_size:
continue
# Group normalization (GRPO core idea)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# Compute policy gradient for each solution
for solution, advantage in zip(solutions, advantages):
policy_logprobs = compute_logprobs(
policy_model, input_ids, solution
)
with torch.no_grad():
ref_logprobs = compute_logprobs(
ref_model, input_ids, solution
)
# PPO-style clipped objective
ratio = torch.exp(policy_logprobs - ref_logprobs)
surr1 = ratio * advantage
surr2 = torch.clamp(
ratio, 1 - clip_range, 1 + clip_range
) * advantage
policy_loss = -torch.min(surr1, surr2).mean()
# KL penalty
kl = (ref_logprobs - policy_logprobs).mean()
total_loss += policy_loss + beta * kl
accuracy = total_correct / total_generated if total_generated > 0 else 0
return total_loss / len(math_problems), accuracy
3. The DeepSeek-R1 Training Pipeline
DeepSeek-R1 demonstrated that RLVR can produce frontier-level reasoning capabilities. The training pipeline has four stages, each building on the previous one. The most remarkable finding was that RL training alone (without any supervised reasoning data) could induce chain-of-thought behavior emergently.
The "aha moment" in DeepSeek-R1 training refers to the point where the model spontaneously began producing extended reasoning chains, including self-correction and backtracking, despite never being trained on chain-of-thought examples. The RL objective (get the right answer) provided sufficient signal for the model to discover that explicit reasoning improves accuracy. This emergent behavior is one of the most striking results in recent AI research.
# Simplified DeepSeek-R1 style training pipeline
from dataclasses import dataclass
from typing import List, Dict
import json
@dataclass
class RLVRPipelineConfig:
"""Configuration for a full RLVR training pipeline."""
base_model: str = "deepseek-ai/DeepSeek-V3-Base"
cold_start_data: str = "path/to/reasoning_examples.json"
math_dataset: str = "hendrycks/MATH"
code_dataset: str = "openai/humaneval"
# GRPO hyperparameters
group_size: int = 16
beta: float = 0.04
clip_range: float = 0.2
learning_rate: float = 1e-6
# Training schedule
cold_start_epochs: int = 2
rl_steps: int = 10000
rejection_sampling_n: int = 64
def run_rlvr_pipeline(config: RLVRPipelineConfig):
"""Run the full RLVR pipeline (simplified)."""
# Stage 1: Cold-start SFT
print("Stage 1: Cold-start SFT on reasoning examples")
cold_start_data = load_json(config.cold_start_data)
model = sft_train(
config.base_model,
cold_start_data,
epochs=config.cold_start_epochs,
)
# Stage 2: RLVR with GRPO
print("Stage 2: RLVR training with verifiable rewards")
math_problems = load_dataset(config.math_dataset)
code_problems = load_dataset(config.code_dataset)
for step in range(config.rl_steps):
# Sample a batch of problems (mix of math and code)
batch = sample_mixed_batch(math_problems, code_problems)
# GRPO step with verifiable rewards
loss, accuracy = grpo_math_training_step(
policy_model=model,
ref_model=ref_model,
tokenizer=tokenizer,
math_problems=batch,
group_size=config.group_size,
beta=config.beta,
)
if step % 100 == 0:
print(f" Step {step}: loss={loss:.4f}, acc={accuracy:.3f}")
# Stage 3: Rejection sampling + SFT
print("Stage 3: Rejection sampling on diverse prompts")
diverse_prompts = load_diverse_prompts()
best_outputs = []
for prompt in diverse_prompts:
candidates = [
model.generate(prompt) for _ in range(config.rejection_sampling_n)
]
# Score and keep the best
scored = [(c, score_response(c)) for c in candidates]
best = max(scored, key=lambda x: x[1])
best_outputs.append({"prompt": prompt, "response": best[0]})
model = sft_train(model, best_outputs, epochs=1)
# Stage 4: Final alignment RL
print("Stage 4: General alignment with helpfulness/safety rewards")
# Mix verifiable rewards with general preference rewards
model = final_alignment_rl(model)
return model
4. Extensions Beyond Math and Code
While RLVR has been most successful in mathematics and coding, researchers are actively exploring extensions to other domains. The key challenge is constructing reliable verifiers for domains where correctness is less clearly defined.
Extending RLVR beyond math and code is challenging because most real-world tasks lack clean verifiable signals. A customer service response cannot be automatically graded as "correct" or "incorrect" in the same way a math solution can. Hybrid approaches that combine verifiable rewards for structured components (factual claims, format compliance) with learned rewards for subjective quality are a promising research direction.
| Domain | Verification Approach | Feasibility | Challenge |
|---|---|---|---|
| Fact-checking | Cross-reference with knowledge base | Medium | Knowledge base completeness |
| Translation | Round-trip translation consistency | Medium | Many valid translations exist |
| Structured output | Schema validation (JSON, XML) | High | Format correctness is not content correctness |
| Tool use | Execute tool call, check result | High | Side effects, environment setup |
| Scientific reasoning | Dimensional analysis, unit checking | Medium | Many steps resist automation |
5. The Open Reasoning Ecosystem
DeepSeek-R1's success with RLVR sparked rapid reproduction efforts across the open-source community. Several projects have demonstrated that the core approach can be replicated at smaller scales with impressive results.
# Using open reasoning models for inference
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load an open reasoning model
model_name = "Qwen/QwQ-32B" # Open reasoning model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="bfloat16", device_map="auto"
)
# Reasoning models produce extended thinking chains
prompt = """Solve this step by step:
A train travels from City A to City B at 60 km/h. The return trip
is made at 40 km/h. What is the average speed for the round trip?"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=2048,
temperature=0.6,
top_p=0.95,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# The model will produce a detailed reasoning chain:
# "Let me think about this carefully...
# Let the distance be d km.
# Time for A to B: d/60 hours
# Time for B to A: d/40 hours
# Total distance: 2d
# Total time: d/60 + d/40 = d(2+3)/120 = 5d/120 = d/24
# Average speed: 2d / (d/24) = 48 km/h
# Wait, let me verify: ..."
The open reasoning ecosystem demonstrates a remarkable pattern: once a training recipe is published, the community can reproduce and extend it rapidly. Sky-T1 (from NovaSky) replicated core R1 results in under $500 of compute. Open-source reasoning models now match or exceed GPT-4o on mathematical reasoning benchmarks, showing that RLVR combined with sufficient base model quality and training data can achieve frontier performance without proprietary infrastructure.
5.1 Open Reasoning Models Comparison
| Model | Base Size | Training Method | MATH Score | AIME 2024 |
|---|---|---|---|---|
| DeepSeek-R1 | 671B MoE | GRPO + RLVR (4 stages) | 97.3% | 79.8% |
| QwQ-32B | 32B | RLVR + SFT | 90.6% | 50.0% |
| DeepSeek-R1-Distill-32B | 32B (distilled) | SFT from R1 outputs | 94.3% | 72.6% |
| Sky-T1-32B | 32B | GRPO reproduction | 82.4% | 43.3% |
| OpenAI o1 | Unknown | Proprietary RL | 96.4% | 83.3% |
Distillation from reasoning models (like DeepSeek-R1-Distill) offers a practical shortcut: instead of running the full RLVR pipeline, you can fine-tune a smaller model on the reasoning traces produced by a larger one. This approach achieves strong results at a fraction of the training cost, though the distilled model may struggle on problems significantly harder than those in the training set.
📝 Section Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
✅ Key Takeaways
- RLVR uses automatically verifiable correctness signals (math answers, code tests, proof checkers) as rewards, eliminating the need for human annotation or learned reward models.
- GRPO combined with verifiable rewards creates an efficient self-improvement loop where the model learns which reasoning patterns lead to correct answers.
- DeepSeek-R1 demonstrated that RLVR can induce emergent chain-of-thought reasoning, including self-correction and backtracking, without any supervised reasoning examples.
- The four-stage pipeline (cold-start SFT, RLVR, rejection sampling, final alignment) has become the standard recipe for training reasoning models.
- Extending RLVR beyond math and code requires constructing reliable verifiers, which remains an open challenge for most real-world tasks.
- The open reasoning ecosystem (QwQ, Sky-T1, R1-Distill) shows that RLVR results can be reproduced at modest scale, democratizing access to reasoning capabilities.