Module 07 · Section 7.3

Reasoning Models & Test-Time Compute

o1/o3, DeepSeek-R1, best-of-N sampling, MCTS for language, and the economics of thinking longer

Thinking longer before answering is advice we give to students and, apparently, also to language models. The difference is that when a model thinks harder, it costs you actual money per second.

A Contemplative Inference Budget
★ Big Picture

The paradigm shift from train-time to test-time compute. For years, the dominant strategy for improving LLM performance was to scale up training: more parameters, more data, more GPU hours. A new paradigm has emerged that invests additional compute at inference time instead. Rather than building a larger model, you let the same model "think longer" on each problem. OpenAI's o1/o3, Google's Gemini "thinking" mode, and DeepSeek-R1 all embody this approach. The results are striking: on challenging math and coding benchmarks, test-time compute scaling can outperform models that are orders of magnitude larger. This section explores the techniques that make this possible and the economic trade-offs that determine when to think harder versus when to use a bigger model.

⚙ Prerequisites

This section assumes familiarity with scaling laws from Section 6.3 and RLHF/alignment concepts from Section 6.1. Understanding of decoding strategies from Section 5.2 (sampling, temperature) helps with the inference-time scaling discussion.

1. Inference-Time Scaling: A New Scaling Law

The classical scaling laws for language models (Kaplan et al. 2020, Chinchilla 2022) describe how performance improves as you increase model parameters and training data. These laws operate at train time: you invest compute once during training and then amortize that cost across every inference request. The model's capabilities are fixed after training. (For the full derivation of compute-optimal scaling, see Section 6.3: Scaling Laws and Compute-Optimal Training.)

Inference-time scaling introduces a complementary dimension. Instead of a single forward pass per query, the model performs multiple reasoning steps, generates several candidate solutions, or searches through a tree of possibilities. The key insight is that some problems benefit far more from additional thinking than from additional parameters.

Key Insight

Consider a math competition problem. A 7B model with 100 attempts and a good verifier can outperform a 70B model with a single attempt. The 7B model's "effective capability" scales with the compute invested at inference time. This creates a new trade-off axis: for the same total compute budget, you can choose between a large model with few attempts or a small model with many attempts.

Research from DeepMind and others has formalized this as a compute-optimal inference problem. Given a fixed inference compute budget C, what combination of model size N and test-time compute T maximizes performance? The answer depends on task difficulty: easy tasks favor a single pass through a large model, while hard tasks favor repeated attempts with verification.

Two Dimensions of Scaling TRAIN-TIME SCALING More Parameters (7B → 70B → 405B) More Training Data (1T → 15T tokens) More Training Compute (GPUs x Time) Fixed cost, amortized over all queries Capabilities locked after training Diminishing returns at frontier Kaplan (2020), Chinchilla (2022) TEST-TIME SCALING Chain-of-Thought Reasoning (o1, R1) Best-of-N Sampling + Verification Tree Search (MCTS, Beam Search) Per-query cost, scales with difficulty Adaptive: easy tasks = less compute Strong gains on hard reasoning tasks o1 (2024), DeepSeek-R1 (2025)
Figure 7.3.1: Train-time scaling invests compute once during training; test-time scaling invests additional compute per query at inference.

2. Chain-of-Thought at Scale: o1/o3 and DeepSeek-R1

Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), showed that asking a model to "think step by step" improves performance on reasoning tasks. The reasoning model paradigm takes this idea to its extreme: instead of relying on a prompt to elicit reasoning, the model is trained to produce extended chains of thought as a fundamental part of its generation process.

2.1 OpenAI o1 and o3

OpenAI's o1 (released September 2024) was the first production reasoning model. Rather than immediately generating an answer, o1 produces a long internal chain of thought that can span hundreds or thousands of tokens. This chain of thought is hidden from the user; only the final answer is displayed. The key elements of the o1 approach include:

The successor model, o3 (released early 2025), demonstrated even more striking performance. On the ARC-AGI benchmark (a test of novel abstract reasoning), o3 achieved scores previously thought to require fundamental architectural breakthroughs, using only test-time compute scaling on top of a transformer backbone.

2.2 DeepSeek-R1: Open Reasoning

DeepSeek-R1, released in January 2025, brought open-weight reasoning models to the community. The training pipeline is notable for its simplicity and its revelations about emergent reasoning behavior:

  1. Cold start with pure RL (R1-Zero): DeepSeek first trained a model called R1-Zero using only reinforcement learning on the base DeepSeek-V3 model, with no supervised fine-tuning on reasoning traces. The model was given math problems and rewarded based on answer correctness. Remarkably, the model spontaneously developed reasoning behaviors: it learned to break problems into steps, verify its own work, and explore alternative approaches.
  2. Supervised fine-tuning on curated reasoning: To improve readability and consistency, DeepSeek then collected high-quality reasoning traces (both from R1-Zero and from human annotators) and performed supervised fine-tuning.
  3. Final RL stage: A second round of reinforcement learning further improved accuracy and alignment.
⚡ Key Insight

Key Insight: Reasoning from pure RL, zero human examples. DeepSeek-R1-Zero demonstrates that a language model trained with only reinforcement learning (no supervised chain-of-thought examples) spontaneously discovers structured reasoning behaviors, including self-verification and backtracking. This is arguably the most scientifically surprising finding in recent LLM research: complex reasoning can emerge from reward signals alone.

2.2.1 GRPO: The Algorithm Behind R1

The reinforcement learning algorithm that enabled R1's emergent reasoning is Group Relative Policy Optimization (GRPO). Unlike standard PPO which trains a separate value (critic) model, GRPO eliminates the critic entirely. Instead, for each prompt, the model generates a group of G candidate responses. Each response receives a reward score (e.g., whether the math answer is correct). The policy gradient is then computed using relative advantage: responses that score above the group mean receive positive updates, and those below receive negative updates.

Formally, for a group of G responses to the same prompt, GRPO computes the advantage for each response i as:

Ai = (ri − mean(r1, ..., rG)) / std(r1, ..., rG)

This approach has two key advantages: (1) it eliminates the need to train and maintain a separate value model, which for large LLMs would require significant additional memory and compute, and (2) the relative comparison within a group naturally normalizes the reward signal, making training more stable across problems of varying difficulty. GRPO is the reason DeepSeek was able to train R1 at a fraction of the cost that would be required for standard PPO with a full critic model.

📰 Paper Spotlight: DeepSeek R1 (DeepSeek-AI, 2025)

The DeepSeek R1 paper is notable for both its technical contributions and its transparency. It demonstrates that (1) reasoning emerges from pure RL without supervised examples, (2) GRPO provides an efficient alternative to PPO for LLM training, and (3) reasoning can be distilled into models as small as 1.5B parameters. The distilled model family provides an unprecedented opportunity: reasoning-capable models that run on consumer hardware, trained using methods that are fully documented and reproducible.

2.3 Distilling Reasoning into Smaller Models

One of the most practical contributions of DeepSeek-R1 was demonstrating that reasoning capabilities can be distilled into much smaller models. By using R1's reasoning traces as training data, the team created distilled versions at 1.5B, 7B, 8B, 14B, 32B, and 70B parameters. Even the 7B distilled model showed substantial improvements on math reasoning benchmarks compared to the base model of the same size. This makes reasoning-enhanced models practical for local deployment.

ⓘ Note

Resource requirements: The DeepSeek-R1-Distill-Qwen-7B model requires approximately 14 GB of disk space (BF16 weights) and at least 16 GB of GPU memory for inference. On first run, the model will be downloaded from Hugging Face Hub.

# Using a distilled reasoning model via Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# The model naturally produces chain-of-thought reasoning
prompt = """A farmer has 17 sheep. All but 9 run away. How many sheep
does the farmer have left? Let me think through this carefully."""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.6,
    do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
User: What is 25 * 37? Let me work through this step by step. 25 * 37 = 25 * (40 - 3) = 25 * 40 - 25 * 3 = 1000 - 75 = 925 The answer is 925.

3. Process Reward Models vs. Outcome Reward Models

A critical component of test-time compute scaling is the reward model that evaluates generated solutions. There are two fundamentally different approaches to providing reward signal for reasoning tasks.

3.1 Outcome Reward Models (ORMs)

An ORM evaluates only the final answer. Given a problem and a complete solution, the ORM assigns a score based on whether the answer is correct. ORMs are simple to train (you only need problem-answer pairs) but provide sparse feedback. A long chain of reasoning that makes one error in step 3 of 20 receives the same low score as a chain that is wrong from the first step. The model cannot learn which specific step went wrong.

3.2 Process Reward Models (PRMs)

A PRM evaluates each step of a reasoning chain individually. After each step, the PRM assigns a score indicating whether that step is logically valid, regardless of the final answer. This provides dense, step-level feedback that enables much more efficient search and training.

ORM: R(problem, full_solution) → score
PRM: R(problem, step_1, step_2, ..., step_k) → score_k for each k

OpenAI's research on PRMs (published in "Let's Verify Step by Step," 2023) demonstrated that process supervision significantly outperforms outcome supervision for math reasoning. When used to select among multiple solution attempts (best-of-N), a PRM-guided approach solves substantially more problems than an ORM-guided approach with the same compute budget.

Outcome vs. Process Reward Models Outcome Reward Model (ORM) Step 1 ✓ Step 2 ✓ Step 3 ✗ Error propagates silently Final: Wrong ✗ Score: 0.0 (no step-level signal) Process Reward Model (PRM) Step 1 ✓ +0.95 Step 2 ✓ +0.91 Step 3 ✗ +0.12 ⚠ Error detected at Step 3 Backtrack & retry Dense signal enables targeted correction Impact on Best-of-N Selection (Math Reasoning) ORM selects by final answer only PRM selects by reasoning quality PRM-guided selection solves ~15% more problems with same compute budget
Figure 7.3.2: ORMs provide only a binary signal on the final answer; PRMs score each reasoning step, enabling targeted backtracking and better solution selection.

3.3 Training a Process Reward Model

Training a PRM requires step-level labels, which are more expensive to collect than outcome labels. There are several strategies:

# Conceptual: Monte Carlo estimation for PRM training data
import numpy as np

def estimate_step_correctness(model, problem, partial_solution, n_completions=64):
    """
    Estimate correctness of a partial solution by sampling completions.
    If many completions from this point reach the right answer,
    the steps so far are likely correct.
    """
    correct_count = 0
    for _ in range(n_completions):
        # Sample a completion from this partial solution
        completion = model.generate(
            prompt=problem + partial_solution,
            temperature=0.8,
            max_tokens=512,
        )
        # Check if the final answer is correct
        if verify_answer(problem, completion):
            correct_count += 1

    # Step correctness ~ fraction of completions that succeed
    return correct_count / n_completions

# Build PRM training data for a multi-step solution
def build_prm_labels(model, problem, solution_steps, ground_truth):
    labels = []
    for k in range(1, len(solution_steps) + 1):
        partial = "\n".join(solution_steps[:k])
        score = estimate_step_correctness(model, problem, partial)
        labels.append({
            "step": k,
            "text": solution_steps[k - 1],
            "correctness_score": score,
        })
    return labels

4. Best-of-N Sampling with Reward-Guided Selection

The simplest form of test-time compute scaling is best-of-N sampling: generate N independent solutions to a problem, score each with a reward model, and return the highest-scoring solution. Despite its simplicity, this approach is surprisingly effective.

4.1 The Mechanics

Given a problem x, a language model M, and a reward model R:

  1. Sample N candidate solutions: y_1, y_2, ..., y_N ~ M(x)
  2. Score each candidate: r_i = R(x, y_i)
  3. Return the best: y* = argmax_i r_i

The performance of best-of-N scales logarithmically with N. Doubling the number of samples provides a diminishing but consistent improvement. For math reasoning with a good PRM, going from N=1 to N=64 can improve accuracy from roughly 50% to 80% on challenging benchmarks.

# Best-of-N sampling with reward model scoring
import torch
from transformers import AutoModelForSequenceClassification

def best_of_n(
    generator_model,
    reward_model,
    tokenizer,
    problem: str,
    n: int = 16,
    temperature: float = 0.7,
):
    """
    Generate N candidate solutions, score with reward model,
    return the highest-scoring one.
    """
    candidates = []

    # Step 1: Generate N diverse solutions
    for i in range(n):
        inputs = tokenizer(problem, return_tensors="pt").to(
            generator_model.device
        )
        output = generator_model.generate(
            **inputs,
            max_new_tokens=1024,
            temperature=temperature,
            do_sample=True,
            top_p=0.95,
        )
        solution = tokenizer.decode(
            output[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )
        candidates.append(solution)

    # Step 2: Score each candidate with the reward model
    scores = []
    for solution in candidates:
        reward_input = tokenizer(
            f"Problem: {problem}\nSolution: {solution}",
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to(reward_model.device)

        with torch.no_grad():
            reward = reward_model(**reward_input).logits.squeeze()
        scores.append(reward.item())

    # Step 3: Return the best candidate
    best_idx = max(range(n), key=lambda i: scores[i])
    return candidates[best_idx], scores[best_idx]

# Example usage
# best_solution, score = best_of_n(
#     generator, reward_model, tokenizer,
#     "Prove that the square root of 2 is irrational.",
#     n=32
# )
⚠ Practical Consideration

Best-of-N is embarrassingly parallel: all N candidates can be generated simultaneously if you have the GPU capacity. However, the total compute scales linearly with N while the benefit scales logarithmically. For production systems, values of N between 8 and 64 typically provide the best cost-benefit ratio. Beyond N=256, the marginal improvement rarely justifies the cost.

5. Monte Carlo Tree Search for Language

While best-of-N treats each solution as an independent sample, tree search methods structure the generation process as an explicit search over a tree of partial solutions. Each node represents a partial solution (reasoning up to step k), and edges represent possible next steps. This allows the search to prune unpromising branches early and focus compute on the most promising directions.

5.1 MCTS Fundamentals for Language

Monte Carlo Tree Search (MCTS), famously used in AlphaGo, operates in four phases that repeat iteratively:

  1. Selection: Starting from the root, traverse the tree by selecting the child node with the highest UCB (Upper Confidence Bound) score, which balances exploitation (high reward) with exploration (less visited nodes)
  2. Expansion: At a leaf node, generate one or more new child nodes by sampling possible next reasoning steps from the language model
  3. Simulation (rollout): From the new node, complete the solution to the end (using fast, possibly lower-quality generation) and evaluate the outcome
  4. Backpropagation: Update the visit counts and value estimates for all nodes along the path from root to the expanded node
UCB(node) = V(node) / N(node) + c · √(ln N(parent) / N(node))

where V is the cumulative value, N is the visit count, and c is an exploration constant.

MCTS Applied to Multi-Step Reasoning Problem Step 1a 12/15 Step 1b 5/10 Step 1c 1/8 S2a 8/9 S2b 3/5 S2c 1/4 Answer ✓ MCTS Cycle 1. Select: Follow UCB to promising leaf 2. Expand: Sample next step from LM 3. Simulate: Complete to get outcome 4. Backpropagate: Update all ancestors ✗ Pruned (low success rate)
Figure 7.3.3: MCTS for language reasoning. Numbers show value/visits. The search focuses compute on the most promising reasoning paths (left branch) while pruning low-value branches (right).

5.2 LATS: Language Agent Tree Search

LATS (Language Agent Tree Search) adapts MCTS specifically for LLM reasoning. The key modifications for the language domain include:

5.3 AlphaProof and Mathematical Reasoning

Google DeepMind's AlphaProof (2024) applied tree search to formal mathematical theorem proving. The system combines a language model (to propose proof steps in Lean 4) with MCTS (to search over the proof tree). AlphaProof solved four of six problems from the 2024 International Mathematical Olympiad, achieving a silver medal equivalent. This represents the strongest result to date in automated mathematical reasoning and demonstrates the power of combining language models with structured search.

6. Compute-Optimal Inference

The central practical question for test-time compute is: when should you think longer versus use a bigger model? Research from Snell et al. (2024) provides a framework for answering this question.

Task Difficulty Optimal Strategy Rationale
Easy (model already >90% accurate) Single pass, larger model Test-time compute has diminishing returns; most samples are already correct
Medium (50-90% accuracy) Best-of-N with moderate N Sampling diversity helps; N=8 to 32 is typically sufficient
Hard (<50% accuracy) Tree search or heavy best-of-N Structured search can find solutions in a vast space; N=64+ or MCTS
Beyond model capability Larger model required No amount of test-time compute helps if the model lacks necessary knowledge
Key Insight

A smaller model with test-time compute scaling can match the performance of a 14x larger model on hard reasoning tasks. For example, a Llama 8B with best-of-N (N=256) and a PRM verifier can match Llama 70B on GSM8K math problems, while using similar total FLOPs. The key requirement is a good verifier: without a reliable reward model, additional sampling does not help because you cannot distinguish good solutions from bad ones.

⚠ Reasoning Models and Hallucination

Extended chain-of-thought reasoning has a dual relationship with hallucination. On one hand, visible reasoning traces make errors more transparent: a user can inspect the chain and identify where the logic went wrong. On the other hand, long reasoning chains can make hallucinations more convincing: a model that produces a coherent, step-by-step argument for an incorrect conclusion is harder to fact-check than one that simply states a wrong answer. This is sometimes called "faithful but wrong" reasoning. Practitioners should treat extended reasoning traces as evidence that requires verification, not as proof of correctness.

# Compute-optimal inference: choosing strategy based on difficulty
def compute_optimal_inference(
    problem,
    easy_model,       # Small, fast model (e.g., 8B)
    hard_model,       # Large, expensive model (e.g., 70B)
    reward_model,     # For scoring candidate solutions
    difficulty_estimator,  # Predicts problem difficulty
):
    """
    Route to different inference strategies based on
    estimated problem difficulty.
    """
    difficulty = difficulty_estimator(problem)

    if difficulty < 0.3:
        # Easy: single pass with the small model is sufficient
        return generate_single(easy_model, problem)

    elif difficulty < 0.7:
        # Medium: best-of-N with small model
        candidates = [
            generate_single(easy_model, problem)
            for _ in range(16)
        ]
        scores = [reward_model.score(problem, c) for c in candidates]
        return candidates[scores.index(max(scores))]

    elif difficulty < 0.9:
        # Hard: best-of-N with large model
        candidates = [
            generate_single(hard_model, problem)
            for _ in range(8)
        ]
        scores = [reward_model.score(problem, c) for c in candidates]
        return candidates[scores.index(max(scores))]

    else:
        # Very hard: tree search with large model
        return mcts_search(hard_model, reward_model, problem,
                           n_iterations=200)

7. Lab: Implementing Best-of-N with a Reward Model

In this lab exercise, we implement best-of-N sampling for math reasoning and measure how accuracy scales with the number of samples.

# Lab: Measure accuracy vs. N on GSM8K math problems
import json
import re
import random

def extract_answer(text):
    """Extract the final numerical answer from a solution."""
    # Look for patterns like "the answer is 42" or "#### 42"
    patterns = [
        r"####\s*(-?[\d,]+)",
        r"answer\s+is\s+(-?[\d,]+)",
        r"=\s*(-?[\d,]+)\s*$",
    ]
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
        if match:
            return match.group(1).replace(",", "")
    return None

def evaluate_best_of_n(model, tokenizer, problems, ns=[1, 4, 8, 16, 32, 64]):
    """
    Evaluate best-of-N at various N values.
    Uses majority voting as a simple reward signal.
    """
    results = {n: 0 for n in ns}
    max_n = max(ns)

    for problem in problems:
        ground_truth = problem["answer"]

        # Generate max_n solutions
        solutions = []
        for _ in range(max_n):
            prompt = f"Solve step by step:\n{problem['question']}\n\n"
            output = model.generate(prompt, temperature=0.7)
            answer = extract_answer(output)
            solutions.append(answer)

        # Evaluate for each N using majority voting
        for n in ns:
            subset = solutions[:n]
            # Majority vote among the N samples
            vote_counts = {}
            for ans in subset:
                if ans is not None:
                    vote_counts[ans] = vote_counts.get(ans, 0) + 1

            if vote_counts:
                majority = max(vote_counts, key=vote_counts.get)
                if majority == ground_truth:
                    results[n] += 1

    # Convert to accuracy
    total = len(problems)
    for n in ns:
        accuracy = results[n] / total * 100
        print(f"N={n:3d}: {accuracy:.1f}% accuracy")

    return results

# Expected output (illustrative):
# N=  1: 52.3% accuracy
# N=  4: 64.1% accuracy
# N=  8: 70.5% accuracy
# N= 16: 75.8% accuracy
# N= 32: 79.2% accuracy
# N= 64: 81.6% accuracy
N= 1: 52.3% accuracy N= 4: 64.1% accuracy N= 8: 70.5% accuracy N= 16: 75.8% accuracy N= 32: 79.2% accuracy N= 64: 81.6% accuracy
📝 Note: Majority Voting vs. Reward Model

The lab above uses majority voting (selecting the most common answer among N candidates) as a simple verification strategy. This works well when the model is more likely to produce the correct answer than any specific wrong answer. For harder problems, a trained reward model (ORM or PRM) provides significantly better selection. The accuracy gains from N=1 to N=64 would be even larger with PRM-guided selection.

Section 7.3 Quiz

1. What is the fundamental difference between train-time scaling and test-time scaling?
Reveal Answer
Train-time scaling invests compute once during training (more parameters, more data, more GPU hours) to produce a model whose capabilities are fixed. Test-time scaling invests additional compute per query at inference time (multiple samples, chain-of-thought, tree search), allowing the model to adaptively allocate more thinking to harder problems. The two approaches are complementary and can be combined.
2. Why did DeepSeek-R1-Zero's emergent reasoning behavior surprise researchers?
Reveal Answer
R1-Zero was trained using pure reinforcement learning with only answer-correctness rewards, without any supervised fine-tuning on human-written reasoning traces. Despite never seeing examples of step-by-step reasoning, the model spontaneously learned to break problems into steps, verify its own work, and self-correct with phrases like "wait, let me reconsider." This demonstrates that extended chain-of-thought reasoning is a natural emergent behavior when optimizing for accuracy with sufficient model capacity, rather than something that must be explicitly taught.
3. How does a Process Reward Model (PRM) differ from an Outcome Reward Model (ORM), and why does this matter for best-of-N selection?
Reveal Answer
An ORM evaluates only the final answer, providing a single score per complete solution. A PRM evaluates each reasoning step individually, providing dense, step-level feedback. For best-of-N selection, PRMs are superior because they can identify solutions where every step is logically sound, rather than solutions that happen to arrive at the right answer through faulty reasoning. PRMs solve roughly 15% more problems than ORMs with the same compute budget, because they select solutions based on reasoning quality rather than just answer correctness.
4. Explain the four phases of Monte Carlo Tree Search (MCTS) as applied to language reasoning.
Reveal Answer
Selection: Starting from the root, follow the UCB (Upper Confidence Bound) formula to traverse the tree to a promising leaf node, balancing exploitation of high-reward paths with exploration of less-visited ones. Expansion: At the leaf, use the language model to generate one or more possible next reasoning steps, creating new child nodes. Simulation (rollout): From the new node, complete the solution to the end using fast generation and evaluate the outcome. Backpropagation: Update the visit counts and value estimates for all ancestor nodes along the path from root to the expanded node, so future selections are informed by this new result.
5. When is test-time compute scaling most and least effective?
Reveal Answer
Test-time compute is most effective on problems of medium to high difficulty where the model has the underlying knowledge but struggles to find the right reasoning path in a single attempt. It is least effective in two regimes: (1) easy problems where the model already achieves high accuracy (additional samples are wasteful since most are already correct), and (2) problems that are fundamentally beyond the model's knowledge or capability (no amount of additional sampling will help if the model lacks the necessary understanding). A good verifier or reward model is also essential; without reliable scoring, you cannot select good solutions from among the candidates.

Key Takeaways

ⓘ Note

What comes next: Having surveyed the cutting-edge reasoning capabilities of modern LLMs, we now ask a critical question: who benefits from these advances? Section 7.4 examines the multilingual landscape, revealing how tokenizer design, training data composition, and evaluation methodology determine whether 6.5 billion non-English speakers receive equitable access to language model capabilities.