Thinking longer before answering is advice we give to students and, apparently, also to language models. The difference is that when a model thinks harder, it costs you actual money per second.
A Contemplative Inference BudgetThe paradigm shift from train-time to test-time compute. For years, the dominant strategy for improving LLM performance was to scale up training: more parameters, more data, more GPU hours. A new paradigm has emerged that invests additional compute at inference time instead. Rather than building a larger model, you let the same model "think longer" on each problem. OpenAI's o1/o3, Google's Gemini "thinking" mode, and DeepSeek-R1 all embody this approach. The results are striking: on challenging math and coding benchmarks, test-time compute scaling can outperform models that are orders of magnitude larger. This section explores the techniques that make this possible and the economic trade-offs that determine when to think harder versus when to use a bigger model.
This section assumes familiarity with scaling laws from Section 6.3 and RLHF/alignment concepts from Section 6.1. Understanding of decoding strategies from Section 5.2 (sampling, temperature) helps with the inference-time scaling discussion.
1. Inference-Time Scaling: A New Scaling Law
The classical scaling laws for language models (Kaplan et al. 2020, Chinchilla 2022) describe how performance improves as you increase model parameters and training data. These laws operate at train time: you invest compute once during training and then amortize that cost across every inference request. The model's capabilities are fixed after training. (For the full derivation of compute-optimal scaling, see Section 6.3: Scaling Laws and Compute-Optimal Training.)
Inference-time scaling introduces a complementary dimension. Instead of a single forward pass per query, the model performs multiple reasoning steps, generates several candidate solutions, or searches through a tree of possibilities. The key insight is that some problems benefit far more from additional thinking than from additional parameters.
Consider a math competition problem. A 7B model with 100 attempts and a good verifier can outperform a 70B model with a single attempt. The 7B model's "effective capability" scales with the compute invested at inference time. This creates a new trade-off axis: for the same total compute budget, you can choose between a large model with few attempts or a small model with many attempts.
Research from DeepMind and others has formalized this as a compute-optimal inference problem. Given a fixed inference compute budget C, what combination of model size N and test-time compute T maximizes performance? The answer depends on task difficulty: easy tasks favor a single pass through a large model, while hard tasks favor repeated attempts with verification.
2. Chain-of-Thought at Scale: o1/o3 and DeepSeek-R1
Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), showed that asking a model to "think step by step" improves performance on reasoning tasks. The reasoning model paradigm takes this idea to its extreme: instead of relying on a prompt to elicit reasoning, the model is trained to produce extended chains of thought as a fundamental part of its generation process.
2.1 OpenAI o1 and o3
OpenAI's o1 (released September 2024) was the first production reasoning model. Rather than immediately generating an answer, o1 produces a long internal chain of thought that can span hundreds or thousands of tokens. This chain of thought is hidden from the user; only the final answer is displayed. The key elements of the o1 approach include:
- Reinforcement learning for reasoning: The model is trained with RL to produce reasoning chains that lead to correct answers. The reward signal comes from verifying final answers against known solutions (for math and code) or from learned reward models (for open-ended tasks).
- Reasoning tokens: The model generates special "thinking" tokens that are not shown to the user. These tokens allow the model to explore different approaches, catch its own mistakes, and verify intermediate steps before committing to an answer.
- Adaptive compute: Harder problems naturally elicit longer chains of thought. A simple arithmetic question might require 50 reasoning tokens, while a complex proof might use 10,000+. The model allocates compute proportional to difficulty without explicit programming.
The successor model, o3 (released early 2025), demonstrated even more striking performance. On the ARC-AGI benchmark (a test of novel abstract reasoning), o3 achieved scores previously thought to require fundamental architectural breakthroughs, using only test-time compute scaling on top of a transformer backbone.
2.2 DeepSeek-R1: Open Reasoning
DeepSeek-R1, released in January 2025, brought open-weight reasoning models to the community. The training pipeline is notable for its simplicity and its revelations about emergent reasoning behavior:
- Cold start with pure RL (R1-Zero): DeepSeek first trained a model called R1-Zero using only reinforcement learning on the base DeepSeek-V3 model, with no supervised fine-tuning on reasoning traces. The model was given math problems and rewarded based on answer correctness. Remarkably, the model spontaneously developed reasoning behaviors: it learned to break problems into steps, verify its own work, and explore alternative approaches.
- Supervised fine-tuning on curated reasoning: To improve readability and consistency, DeepSeek then collected high-quality reasoning traces (both from R1-Zero and from human annotators) and performed supervised fine-tuning.
- Final RL stage: A second round of reinforcement learning further improved accuracy and alignment.
Key Insight: Reasoning from pure RL, zero human examples. DeepSeek-R1-Zero demonstrates that a language model trained with only reinforcement learning (no supervised chain-of-thought examples) spontaneously discovers structured reasoning behaviors, including self-verification and backtracking. This is arguably the most scientifically surprising finding in recent LLM research: complex reasoning can emerge from reward signals alone.
2.2.1 GRPO: The Algorithm Behind R1
The reinforcement learning algorithm that enabled R1's emergent reasoning is Group Relative Policy Optimization (GRPO). Unlike standard PPO which trains a separate value (critic) model, GRPO eliminates the critic entirely. Instead, for each prompt, the model generates a group of G candidate responses. Each response receives a reward score (e.g., whether the math answer is correct). The policy gradient is then computed using relative advantage: responses that score above the group mean receive positive updates, and those below receive negative updates.
Formally, for a group of G responses to the same prompt, GRPO computes the advantage for each response i as:
This approach has two key advantages: (1) it eliminates the need to train and maintain a separate value model, which for large LLMs would require significant additional memory and compute, and (2) the relative comparison within a group naturally normalizes the reward signal, making training more stable across problems of varying difficulty. GRPO is the reason DeepSeek was able to train R1 at a fraction of the cost that would be required for standard PPO with a full critic model.
The DeepSeek R1 paper is notable for both its technical contributions and its transparency. It demonstrates that (1) reasoning emerges from pure RL without supervised examples, (2) GRPO provides an efficient alternative to PPO for LLM training, and (3) reasoning can be distilled into models as small as 1.5B parameters. The distilled model family provides an unprecedented opportunity: reasoning-capable models that run on consumer hardware, trained using methods that are fully documented and reproducible.
2.3 Distilling Reasoning into Smaller Models
One of the most practical contributions of DeepSeek-R1 was demonstrating that reasoning capabilities can be distilled into much smaller models. By using R1's reasoning traces as training data, the team created distilled versions at 1.5B, 7B, 8B, 14B, 32B, and 70B parameters. Even the 7B distilled model showed substantial improvements on math reasoning benchmarks compared to the base model of the same size. This makes reasoning-enhanced models practical for local deployment.
Resource requirements: The DeepSeek-R1-Distill-Qwen-7B model requires approximately 14 GB of disk space (BF16 weights) and at least 16 GB of GPU memory for inference. On first run, the model will be downloaded from Hugging Face Hub.
# Using a distilled reasoning model via Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# The model naturally produces chain-of-thought reasoning
prompt = """A farmer has 17 sheep. All but 9 run away. How many sheep
does the farmer have left? Let me think through this carefully."""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.6,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
3. Process Reward Models vs. Outcome Reward Models
A critical component of test-time compute scaling is the reward model that evaluates generated solutions. There are two fundamentally different approaches to providing reward signal for reasoning tasks.
3.1 Outcome Reward Models (ORMs)
An ORM evaluates only the final answer. Given a problem and a complete solution, the ORM assigns a score based on whether the answer is correct. ORMs are simple to train (you only need problem-answer pairs) but provide sparse feedback. A long chain of reasoning that makes one error in step 3 of 20 receives the same low score as a chain that is wrong from the first step. The model cannot learn which specific step went wrong.
3.2 Process Reward Models (PRMs)
A PRM evaluates each step of a reasoning chain individually. After each step, the PRM assigns a score indicating whether that step is logically valid, regardless of the final answer. This provides dense, step-level feedback that enables much more efficient search and training.
PRM: R(problem, step_1, step_2, ..., step_k) → score_k for each k
OpenAI's research on PRMs (published in "Let's Verify Step by Step," 2023) demonstrated that process supervision significantly outperforms outcome supervision for math reasoning. When used to select among multiple solution attempts (best-of-N), a PRM-guided approach solves substantially more problems than an ORM-guided approach with the same compute budget.
3.3 Training a Process Reward Model
Training a PRM requires step-level labels, which are more expensive to collect than outcome labels. There are several strategies:
- Human annotation: Mathematicians label each reasoning step as correct or incorrect (expensive but high quality)
- Monte Carlo estimation: For each partial solution up to step k, sample many completions and estimate step correctness based on the fraction that lead to a correct final answer
- Automated verification: For domains with formal verification (math, code), steps can be checked automatically using symbolic solvers or test suites
# Conceptual: Monte Carlo estimation for PRM training data
import numpy as np
def estimate_step_correctness(model, problem, partial_solution, n_completions=64):
"""
Estimate correctness of a partial solution by sampling completions.
If many completions from this point reach the right answer,
the steps so far are likely correct.
"""
correct_count = 0
for _ in range(n_completions):
# Sample a completion from this partial solution
completion = model.generate(
prompt=problem + partial_solution,
temperature=0.8,
max_tokens=512,
)
# Check if the final answer is correct
if verify_answer(problem, completion):
correct_count += 1
# Step correctness ~ fraction of completions that succeed
return correct_count / n_completions
# Build PRM training data for a multi-step solution
def build_prm_labels(model, problem, solution_steps, ground_truth):
labels = []
for k in range(1, len(solution_steps) + 1):
partial = "\n".join(solution_steps[:k])
score = estimate_step_correctness(model, problem, partial)
labels.append({
"step": k,
"text": solution_steps[k - 1],
"correctness_score": score,
})
return labels
4. Best-of-N Sampling with Reward-Guided Selection
The simplest form of test-time compute scaling is best-of-N sampling: generate N independent solutions to a problem, score each with a reward model, and return the highest-scoring solution. Despite its simplicity, this approach is surprisingly effective.
4.1 The Mechanics
Given a problem x, a language model M, and a reward model R:
- Sample N candidate solutions: y_1, y_2, ..., y_N ~ M(x)
- Score each candidate: r_i = R(x, y_i)
- Return the best: y* = argmax_i r_i
The performance of best-of-N scales logarithmically with N. Doubling the number of samples provides a diminishing but consistent improvement. For math reasoning with a good PRM, going from N=1 to N=64 can improve accuracy from roughly 50% to 80% on challenging benchmarks.
# Best-of-N sampling with reward model scoring
import torch
from transformers import AutoModelForSequenceClassification
def best_of_n(
generator_model,
reward_model,
tokenizer,
problem: str,
n: int = 16,
temperature: float = 0.7,
):
"""
Generate N candidate solutions, score with reward model,
return the highest-scoring one.
"""
candidates = []
# Step 1: Generate N diverse solutions
for i in range(n):
inputs = tokenizer(problem, return_tensors="pt").to(
generator_model.device
)
output = generator_model.generate(
**inputs,
max_new_tokens=1024,
temperature=temperature,
do_sample=True,
top_p=0.95,
)
solution = tokenizer.decode(
output[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
candidates.append(solution)
# Step 2: Score each candidate with the reward model
scores = []
for solution in candidates:
reward_input = tokenizer(
f"Problem: {problem}\nSolution: {solution}",
return_tensors="pt",
truncation=True,
max_length=2048,
).to(reward_model.device)
with torch.no_grad():
reward = reward_model(**reward_input).logits.squeeze()
scores.append(reward.item())
# Step 3: Return the best candidate
best_idx = max(range(n), key=lambda i: scores[i])
return candidates[best_idx], scores[best_idx]
# Example usage
# best_solution, score = best_of_n(
# generator, reward_model, tokenizer,
# "Prove that the square root of 2 is irrational.",
# n=32
# )
Best-of-N is embarrassingly parallel: all N candidates can be generated simultaneously if you have the GPU capacity. However, the total compute scales linearly with N while the benefit scales logarithmically. For production systems, values of N between 8 and 64 typically provide the best cost-benefit ratio. Beyond N=256, the marginal improvement rarely justifies the cost.
5. Monte Carlo Tree Search for Language
While best-of-N treats each solution as an independent sample, tree search methods structure the generation process as an explicit search over a tree of partial solutions. Each node represents a partial solution (reasoning up to step k), and edges represent possible next steps. This allows the search to prune unpromising branches early and focus compute on the most promising directions.
5.1 MCTS Fundamentals for Language
Monte Carlo Tree Search (MCTS), famously used in AlphaGo, operates in four phases that repeat iteratively:
- Selection: Starting from the root, traverse the tree by selecting the child node with the highest UCB (Upper Confidence Bound) score, which balances exploitation (high reward) with exploration (less visited nodes)
- Expansion: At a leaf node, generate one or more new child nodes by sampling possible next reasoning steps from the language model
- Simulation (rollout): From the new node, complete the solution to the end (using fast, possibly lower-quality generation) and evaluate the outcome
- Backpropagation: Update the visit counts and value estimates for all nodes along the path from root to the expanded node
where V is the cumulative value, N is the visit count, and c is an exploration constant.
5.2 LATS: Language Agent Tree Search
LATS (Language Agent Tree Search) adapts MCTS specifically for LLM reasoning. The key modifications for the language domain include:
- LLM as policy and value function: Instead of a separately trained value network (as in AlphaGo), LATS uses the LLM itself to both generate candidate steps (policy) and evaluate partial solutions (value). This avoids the need for separate value model training.
- Natural language state representation: Each tree node contains the reasoning text so far, making it possible to leverage the full context window of the LLM for each expansion.
- Reflection-based backpropagation: When a rollout fails, the LLM generates a reflection on what went wrong, and this information is stored in the tree to guide future expansions away from similar errors.
5.3 AlphaProof and Mathematical Reasoning
Google DeepMind's AlphaProof (2024) applied tree search to formal mathematical theorem proving. The system combines a language model (to propose proof steps in Lean 4) with MCTS (to search over the proof tree). AlphaProof solved four of six problems from the 2024 International Mathematical Olympiad, achieving a silver medal equivalent. This represents the strongest result to date in automated mathematical reasoning and demonstrates the power of combining language models with structured search.
6. Compute-Optimal Inference
The central practical question for test-time compute is: when should you think longer versus use a bigger model? Research from Snell et al. (2024) provides a framework for answering this question.
| Task Difficulty | Optimal Strategy | Rationale |
|---|---|---|
| Easy (model already >90% accurate) | Single pass, larger model | Test-time compute has diminishing returns; most samples are already correct |
| Medium (50-90% accuracy) | Best-of-N with moderate N | Sampling diversity helps; N=8 to 32 is typically sufficient |
| Hard (<50% accuracy) | Tree search or heavy best-of-N | Structured search can find solutions in a vast space; N=64+ or MCTS |
| Beyond model capability | Larger model required | No amount of test-time compute helps if the model lacks necessary knowledge |
A smaller model with test-time compute scaling can match the performance of a 14x larger model on hard reasoning tasks. For example, a Llama 8B with best-of-N (N=256) and a PRM verifier can match Llama 70B on GSM8K math problems, while using similar total FLOPs. The key requirement is a good verifier: without a reliable reward model, additional sampling does not help because you cannot distinguish good solutions from bad ones.
Extended chain-of-thought reasoning has a dual relationship with hallucination. On one hand, visible reasoning traces make errors more transparent: a user can inspect the chain and identify where the logic went wrong. On the other hand, long reasoning chains can make hallucinations more convincing: a model that produces a coherent, step-by-step argument for an incorrect conclusion is harder to fact-check than one that simply states a wrong answer. This is sometimes called "faithful but wrong" reasoning. Practitioners should treat extended reasoning traces as evidence that requires verification, not as proof of correctness.
# Compute-optimal inference: choosing strategy based on difficulty
def compute_optimal_inference(
problem,
easy_model, # Small, fast model (e.g., 8B)
hard_model, # Large, expensive model (e.g., 70B)
reward_model, # For scoring candidate solutions
difficulty_estimator, # Predicts problem difficulty
):
"""
Route to different inference strategies based on
estimated problem difficulty.
"""
difficulty = difficulty_estimator(problem)
if difficulty < 0.3:
# Easy: single pass with the small model is sufficient
return generate_single(easy_model, problem)
elif difficulty < 0.7:
# Medium: best-of-N with small model
candidates = [
generate_single(easy_model, problem)
for _ in range(16)
]
scores = [reward_model.score(problem, c) for c in candidates]
return candidates[scores.index(max(scores))]
elif difficulty < 0.9:
# Hard: best-of-N with large model
candidates = [
generate_single(hard_model, problem)
for _ in range(8)
]
scores = [reward_model.score(problem, c) for c in candidates]
return candidates[scores.index(max(scores))]
else:
# Very hard: tree search with large model
return mcts_search(hard_model, reward_model, problem,
n_iterations=200)
7. Lab: Implementing Best-of-N with a Reward Model
In this lab exercise, we implement best-of-N sampling for math reasoning and measure how accuracy scales with the number of samples.
# Lab: Measure accuracy vs. N on GSM8K math problems
import json
import re
import random
def extract_answer(text):
"""Extract the final numerical answer from a solution."""
# Look for patterns like "the answer is 42" or "#### 42"
patterns = [
r"####\s*(-?[\d,]+)",
r"answer\s+is\s+(-?[\d,]+)",
r"=\s*(-?[\d,]+)\s*$",
]
for pattern in patterns:
match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
if match:
return match.group(1).replace(",", "")
return None
def evaluate_best_of_n(model, tokenizer, problems, ns=[1, 4, 8, 16, 32, 64]):
"""
Evaluate best-of-N at various N values.
Uses majority voting as a simple reward signal.
"""
results = {n: 0 for n in ns}
max_n = max(ns)
for problem in problems:
ground_truth = problem["answer"]
# Generate max_n solutions
solutions = []
for _ in range(max_n):
prompt = f"Solve step by step:\n{problem['question']}\n\n"
output = model.generate(prompt, temperature=0.7)
answer = extract_answer(output)
solutions.append(answer)
# Evaluate for each N using majority voting
for n in ns:
subset = solutions[:n]
# Majority vote among the N samples
vote_counts = {}
for ans in subset:
if ans is not None:
vote_counts[ans] = vote_counts.get(ans, 0) + 1
if vote_counts:
majority = max(vote_counts, key=vote_counts.get)
if majority == ground_truth:
results[n] += 1
# Convert to accuracy
total = len(problems)
for n in ns:
accuracy = results[n] / total * 100
print(f"N={n:3d}: {accuracy:.1f}% accuracy")
return results
# Expected output (illustrative):
# N= 1: 52.3% accuracy
# N= 4: 64.1% accuracy
# N= 8: 70.5% accuracy
# N= 16: 75.8% accuracy
# N= 32: 79.2% accuracy
# N= 64: 81.6% accuracy
The lab above uses majority voting (selecting the most common answer among N candidates) as a simple verification strategy. This works well when the model is more likely to produce the correct answer than any specific wrong answer. For harder problems, a trained reward model (ORM or PRM) provides significantly better selection. The accuracy gains from N=1 to N=64 would be even larger with PRM-guided selection.
Section 7.3 Quiz
Reveal Answer
Reveal Answer
Reveal Answer
Reveal Answer
Reveal Answer
Key Takeaways
- Test-time compute scaling is a new paradigm complementing train-time scaling. Instead of building larger models, you let the same model think longer on harder problems, allocating compute proportional to difficulty.
- Reasoning models (o1/o3, DeepSeek-R1) are trained with reinforcement learning to produce extended chains of thought. DeepSeek-R1-Zero showed that this reasoning behavior can emerge from pure RL without any supervised reasoning examples.
- Process Reward Models (PRMs) provide step-level feedback that significantly outperforms outcome-only reward models (ORMs) for guiding search and selection in reasoning tasks.
- Best-of-N sampling is the simplest test-time scaling method: generate N candidates and select the best. Performance scales logarithmically with N, and the approach is embarrassingly parallel.
- Monte Carlo Tree Search (MCTS) brings structured search to language generation, enabling focused exploration of promising reasoning paths while pruning dead ends. LATS and AlphaProof demonstrate the power of this approach.
- Compute-optimal inference depends on task difficulty. Easy tasks favor a single pass through a large model; hard tasks favor many attempts with a smaller model and a good verifier. A smaller model with test-time compute can match a 14x larger model on hard reasoning tasks.
What comes next: Having surveyed the cutting-edge reasoning capabilities of modern LLMs, we now ask a critical question: who benefits from these advances? Section 7.4 examines the multilingual landscape, revealing how tokenizer design, training data composition, and evaluation methodology determine whether 6.5 billion non-English speakers receive equitable access to language model capabilities.