Module 16 · Section 16.4

RLVR: Reinforcement Learning with Verifiable Rewards

Training reasoning models with automated, verifiable reward signals from math, code, and formal proofs
★ Big Picture

RLVR removes humans from the reward loop entirely by using objectively verifiable correctness signals. For domains like mathematics, programming, and formal proofs, the correctness of an answer can be checked automatically: a math solution is either right or wrong, code either passes tests or fails, and a proof either verifies or does not. RLVR exploits this property to train reasoning models at massive scale without any human preference data. This paradigm powered DeepSeek-R1 and has sparked an open ecosystem of reasoning models that achieve frontier-level performance on mathematical and coding benchmarks.

1. The Verifiable Reward Paradigm

Standard RLHF relies on a learned reward model trained on human preferences. This reward model is imperfect: it can be gamed, it introduces noise, and it caps the alignment quality at the level of annotator agreement. For domains where correctness is objectively verifiable, we can bypass the reward model entirely and use the ground truth as the reward signal.

The key insight behind RLVR is simple: if you can write a function that checks whether an answer is correct, you have a perfect reward signal. No reward model training, no human annotation, no preference noise. The reward is binary (correct or incorrect) or graded (partially correct), and it is always accurate.

RLHF (Learned Reward) LLM generates Reward Model (learned, noisy) r = 0.73 approximate RLVR (Verifiable Reward) LLM generates Verifier (test suite / checker) r = 1.0 exact Verifiable Domains Math: check final answer Code: run test suite Proofs: formal verification Logic: truth table validation
Figure 16.9: RLHF uses a learned (noisy) reward model. RLVR uses verifiable correctness checks, producing exact reward signals without human annotation.

1.1 Types of Verifiable Rewards

DomainVerification MethodReward TypeDataset Examples
MathematicsCompare to known answer; symbolic checkingBinary (correct/wrong)GSM8K, MATH, AIME
Code generationExecute against test suiteGraded (pass@k tests)HumanEval, MBPP, SWE-bench
Formal proofsLean/Coq/Isabelle type checkerBinary (verifies/fails)miniF2F, ProofNet
Logic puzzlesConstraint satisfaction checkBinary (satisfies/violates)FOLIO, ProntoQA
Format complianceRegex, JSON schema validationBinary (valid/invalid)Custom structured output tasks
# Verifiable reward functions for different domains
import subprocess
import json
import re
from typing import Optional

def math_reward(
    model_answer: str,
    ground_truth: str,
    tolerance: float = 1e-6,
) -> float:
    """Binary reward for math: 1.0 if correct, 0.0 otherwise."""
    # Extract numerical answer from model output
    extracted = extract_final_answer(model_answer)
    if extracted is None:
        return 0.0

    try:
        model_val = float(extracted)
        truth_val = float(ground_truth)
        return 1.0 if abs(model_val - truth_val) < tolerance else 0.0
    except ValueError:
        # String comparison for symbolic answers
        return 1.0 if extracted.strip() == ground_truth.strip() else 0.0


def code_reward(
    generated_code: str,
    test_cases: list,
    timeout: int = 10,
) -> float:
    """Graded reward for code: fraction of tests passed."""
    passed = 0
    for test in test_cases:
        full_code = generated_code + "\n" + test["test_code"]
        try:
            result = subprocess.run(
                ["python", "-c", full_code],
                capture_output=True,
                timeout=timeout,
                text=True,
            )
            if result.returncode == 0:
                passed += 1
        except subprocess.TimeoutExpired:
            continue

    return passed / len(test_cases) if test_cases else 0.0


def proof_reward(
    proof_text: str,
    theorem_statement: str,
    lean_project_path: str,
) -> float:
    """Binary reward for Lean 4 proofs."""
    # Write proof to a temporary Lean file
    lean_code = f"{theorem_statement}\n{proof_text}"
    with open(f"{lean_project_path}/Temp.lean", "w") as f:
        f.write(lean_code)

    result = subprocess.run(
        ["lake", "build", "Temp"],
        capture_output=True,
        cwd=lean_project_path,
        timeout=60,
    )
    return 1.0 if result.returncode == 0 else 0.0


def extract_final_answer(text: str) -> Optional[str]:
    """Extract boxed answer from math response."""
    # Look for \boxed{...} or "The answer is ..."
    boxed = re.findall(r"\\boxed\{([^}]+)\}", text)
    if boxed:
        return boxed[-1]

    answer_pattern = re.findall(
        r"(?:the answer is|therefore|thus)[:\s]+([^\n.]+)",
        text, re.IGNORECASE,
    )
    if answer_pattern:
        return answer_pattern[-1].strip()

    return None

2. The GRPO Algorithm for Reasoning

Group Relative Policy Optimization (GRPO), introduced in Section 16.1, becomes especially powerful when combined with verifiable rewards. For math problems, GRPO generates a group of solutions, checks each one against the ground truth, and uses the binary outcomes to compute advantages. Solutions that reach the correct answer get positive advantage; those that fail get negative advantage. No reward model is needed.

💡 Key Insight

The combination of GRPO + verifiable rewards creates a powerful self-improvement loop. The model learns which reasoning patterns lead to correct answers simply by generating many attempts and observing which ones succeed. This is analogous to how a student improves at math by solving many problems and checking the answer key, without needing a teacher to grade the work.

# GRPO with Verifiable Rewards for Math Reasoning
import torch
from typing import List, Callable

def grpo_math_training_step(
    policy_model,
    ref_model,
    tokenizer,
    math_problems: List[dict],  # {"prompt": ..., "answer": ...}
    group_size: int = 16,
    beta: float = 0.04,
    clip_range: float = 0.2,
):
    """
    One GRPO training step using verifiable math rewards.

    For each problem, generate group_size solutions,
    check correctness, normalize rewards within the group,
    and compute the policy gradient.
    """
    total_loss = 0.0
    total_correct = 0
    total_generated = 0

    for problem in math_problems:
        prompt = problem["prompt"]
        ground_truth = problem["answer"]
        input_ids = tokenizer.encode(prompt, return_tensors="pt")

        # Generate a group of solutions
        solutions = []
        for _ in range(group_size):
            output = policy_model.generate(
                input_ids,
                max_new_tokens=1024,
                do_sample=True,
                temperature=0.7,
                top_p=0.95,
            )
            solutions.append(output[0])

        # Compute verifiable rewards (binary: correct or not)
        rewards = torch.tensor([
            math_reward(tokenizer.decode(s), ground_truth)
            for s in solutions
        ])

        total_correct += rewards.sum().item()
        total_generated += group_size

        # Skip if all correct or all wrong (no learning signal)
        if rewards.sum() == 0 or rewards.sum() == group_size:
            continue

        # Group normalization (GRPO core idea)
        advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

        # Compute policy gradient for each solution
        for solution, advantage in zip(solutions, advantages):
            policy_logprobs = compute_logprobs(
                policy_model, input_ids, solution
            )
            with torch.no_grad():
                ref_logprobs = compute_logprobs(
                    ref_model, input_ids, solution
                )

            # PPO-style clipped objective
            ratio = torch.exp(policy_logprobs - ref_logprobs)
            surr1 = ratio * advantage
            surr2 = torch.clamp(
                ratio, 1 - clip_range, 1 + clip_range
            ) * advantage
            policy_loss = -torch.min(surr1, surr2).mean()

            # KL penalty
            kl = (ref_logprobs - policy_logprobs).mean()
            total_loss += policy_loss + beta * kl

    accuracy = total_correct / total_generated if total_generated > 0 else 0
    return total_loss / len(math_problems), accuracy

3. The DeepSeek-R1 Training Pipeline

DeepSeek-R1 demonstrated that RLVR can produce frontier-level reasoning capabilities. The training pipeline has four stages, each building on the previous one. The most remarkable finding was that RL training alone (without any supervised reasoning data) could induce chain-of-thought behavior emergently.

Stage 1 Cold Start SFT Small set of reasoning examples (thousands) Stage 2 RLVR (GRPO) Math + code rewards Emergent CoT Stage 3 Rejection Sampling + SFT on best outputs from Stage 2 model Stage 4 Final RL + general alignment (helpfulness/safety) Key Finding: Emergent Reasoning Chain-of-thought, self-verification, and backtracking emerged spontaneously from RL training on math/code rewards "Aha moment": the model learned to re-examine its reasoning mid-generation
Figure 16.10: The DeepSeek-R1 four-stage training pipeline. RLVR with GRPO (Stage 2) induces emergent chain-of-thought reasoning. Subsequent stages refine and generalize this capability.
📝 Note

The "aha moment" in DeepSeek-R1 training refers to the point where the model spontaneously began producing extended reasoning chains, including self-correction and backtracking, despite never being trained on chain-of-thought examples. The RL objective (get the right answer) provided sufficient signal for the model to discover that explicit reasoning improves accuracy. This emergent behavior is one of the most striking results in recent AI research.

# Simplified DeepSeek-R1 style training pipeline
from dataclasses import dataclass
from typing import List, Dict
import json

@dataclass
class RLVRPipelineConfig:
    """Configuration for a full RLVR training pipeline."""
    base_model: str = "deepseek-ai/DeepSeek-V3-Base"
    cold_start_data: str = "path/to/reasoning_examples.json"
    math_dataset: str = "hendrycks/MATH"
    code_dataset: str = "openai/humaneval"

    # GRPO hyperparameters
    group_size: int = 16
    beta: float = 0.04
    clip_range: float = 0.2
    learning_rate: float = 1e-6

    # Training schedule
    cold_start_epochs: int = 2
    rl_steps: int = 10000
    rejection_sampling_n: int = 64

def run_rlvr_pipeline(config: RLVRPipelineConfig):
    """Run the full RLVR pipeline (simplified)."""

    # Stage 1: Cold-start SFT
    print("Stage 1: Cold-start SFT on reasoning examples")
    cold_start_data = load_json(config.cold_start_data)
    model = sft_train(
        config.base_model,
        cold_start_data,
        epochs=config.cold_start_epochs,
    )

    # Stage 2: RLVR with GRPO
    print("Stage 2: RLVR training with verifiable rewards")
    math_problems = load_dataset(config.math_dataset)
    code_problems = load_dataset(config.code_dataset)

    for step in range(config.rl_steps):
        # Sample a batch of problems (mix of math and code)
        batch = sample_mixed_batch(math_problems, code_problems)

        # GRPO step with verifiable rewards
        loss, accuracy = grpo_math_training_step(
            policy_model=model,
            ref_model=ref_model,
            tokenizer=tokenizer,
            math_problems=batch,
            group_size=config.group_size,
            beta=config.beta,
        )

        if step % 100 == 0:
            print(f"  Step {step}: loss={loss:.4f}, acc={accuracy:.3f}")

    # Stage 3: Rejection sampling + SFT
    print("Stage 3: Rejection sampling on diverse prompts")
    diverse_prompts = load_diverse_prompts()
    best_outputs = []
    for prompt in diverse_prompts:
        candidates = [
            model.generate(prompt) for _ in range(config.rejection_sampling_n)
        ]
        # Score and keep the best
        scored = [(c, score_response(c)) for c in candidates]
        best = max(scored, key=lambda x: x[1])
        best_outputs.append({"prompt": prompt, "response": best[0]})

    model = sft_train(model, best_outputs, epochs=1)

    # Stage 4: Final alignment RL
    print("Stage 4: General alignment with helpfulness/safety rewards")
    # Mix verifiable rewards with general preference rewards
    model = final_alignment_rl(model)

    return model

4. Extensions Beyond Math and Code

While RLVR has been most successful in mathematics and coding, researchers are actively exploring extensions to other domains. The key challenge is constructing reliable verifiers for domains where correctness is less clearly defined.

⚠ Warning

Extending RLVR beyond math and code is challenging because most real-world tasks lack clean verifiable signals. A customer service response cannot be automatically graded as "correct" or "incorrect" in the same way a math solution can. Hybrid approaches that combine verifiable rewards for structured components (factual claims, format compliance) with learned rewards for subjective quality are a promising research direction.

DomainVerification ApproachFeasibilityChallenge
Fact-checkingCross-reference with knowledge baseMediumKnowledge base completeness
TranslationRound-trip translation consistencyMediumMany valid translations exist
Structured outputSchema validation (JSON, XML)HighFormat correctness is not content correctness
Tool useExecute tool call, check resultHighSide effects, environment setup
Scientific reasoningDimensional analysis, unit checkingMediumMany steps resist automation

5. The Open Reasoning Ecosystem

DeepSeek-R1's success with RLVR sparked rapid reproduction efforts across the open-source community. Several projects have demonstrated that the core approach can be replicated at smaller scales with impressive results.

# Using open reasoning models for inference
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load an open reasoning model
model_name = "Qwen/QwQ-32B"  # Open reasoning model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="bfloat16", device_map="auto"
)

# Reasoning models produce extended thinking chains
prompt = """Solve this step by step:
A train travels from City A to City B at 60 km/h. The return trip
is made at 40 km/h. What is the average speed for the round trip?"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.6,
    top_p=0.95,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# The model will produce a detailed reasoning chain:
# "Let me think about this carefully...
#  Let the distance be d km.
#  Time for A to B: d/60 hours
#  Time for B to A: d/40 hours
#  Total distance: 2d
#  Total time: d/60 + d/40 = d(2+3)/120 = 5d/120 = d/24
#  Average speed: 2d / (d/24) = 48 km/h
#  Wait, let me verify: ..."
💡 Key Insight

The open reasoning ecosystem demonstrates a remarkable pattern: once a training recipe is published, the community can reproduce and extend it rapidly. Sky-T1 (from NovaSky) replicated core R1 results in under $500 of compute. Open-source reasoning models now match or exceed GPT-4o on mathematical reasoning benchmarks, showing that RLVR combined with sufficient base model quality and training data can achieve frontier performance without proprietary infrastructure.

5.1 Open Reasoning Models Comparison

ModelBase SizeTraining MethodMATH ScoreAIME 2024
DeepSeek-R1671B MoEGRPO + RLVR (4 stages)97.3%79.8%
QwQ-32B32BRLVR + SFT90.6%50.0%
DeepSeek-R1-Distill-32B32B (distilled)SFT from R1 outputs94.3%72.6%
Sky-T1-32B32BGRPO reproduction82.4%43.3%
OpenAI o1UnknownProprietary RL96.4%83.3%
📝 Note

Distillation from reasoning models (like DeepSeek-R1-Distill) offers a practical shortcut: instead of running the full RLVR pipeline, you can fine-tune a smaller model on the reasoning traces produced by a larger one. This approach achieves strong results at a fraction of the training cost, though the distilled model may struggle on problems significantly harder than those in the training set.

📝 Section Quiz

1. What makes RLVR fundamentally different from RLHF?
Show Answer
RLVR uses objectively verifiable correctness signals (math answer checking, code test execution, proof verification) instead of a learned reward model trained on human preferences. This eliminates reward model noise, removes the need for human annotation, and provides a perfect reward signal for domains where correctness can be automatically checked.
2. Why is GRPO particularly well-suited for RLVR?
Show Answer
GRPO generates a group of responses and normalizes rewards within the group, eliminating the need for a learned value network. With verifiable rewards, the group contains solutions that are definitively correct or incorrect. The group statistics provide a natural baseline: in a group of 16 math solutions where 4 are correct, the correct ones get positive advantage and incorrect ones get negative advantage, without any learned approximation.
3. What was the "aha moment" in DeepSeek-R1 training?
Show Answer
During RLVR training, the model spontaneously began producing extended chain-of-thought reasoning, including self-correction and backtracking, despite never being trained on chain-of-thought examples. The RL objective (maximize correctness) provided sufficient signal for the model to discover that explicit step-by-step reasoning improves accuracy. This emergent behavior was not programmed but arose naturally from the optimization pressure.
4. Why is extending RLVR to open-ended tasks difficult?
Show Answer
Most real-world tasks (customer service, creative writing, general conversation) lack objectively verifiable correctness signals. You cannot write a function that definitively grades whether a customer service response is "correct." Hybrid approaches combining verifiable components (fact-checking, format compliance) with learned reward models for subjective quality are a promising research direction.
5. How does distillation from reasoning models compare to full RLVR training?
Show Answer
Distillation fine-tunes a smaller model on the reasoning traces of a larger RLVR-trained model. It achieves strong results (e.g., DeepSeek-R1-Distill-32B reaches 94.3% on MATH) at a fraction of the training cost. However, distilled models may generalize less well to problems harder than those in the training set, since they learn to imitate reasoning patterns rather than discovering them through RL optimization pressure.

✅ Key Takeaways