Section 6.3: Scaling Laws & Compute-Optimal Training

Scaling laws are the rare case where "just make it bigger" turned out to be rigorous science. The math says your model is too small, your data is too little, and your budget is never enough. At least now you can quantify the despair.

A Compute-Optimal Accountant

★ Big Picture

Why do scaling laws matter? Training a large language model costs millions of dollars. Scaling laws provide a mathematical framework for predicting a model's performance before committing those resources. They answer critical questions: How big should the model be? How much data does it need? What loss can we expect for a given compute budget? The Kaplan and Chinchilla scaling laws have fundamentally reshaped how the industry trains models, and understanding them is essential for anyone working with LLMs at scale.

⚙ Prerequisites

This section assumes familiarity with the landmark models from Section 6.1 and pre-training objectives from Section 6.2. Understanding of logarithmic relationships and basic calculus (derivatives for optimization) helps with the mathematical content. The scaling laws discussed here connect forward to inference-time scaling in Section 7.3.

1. The Power Law Foundation

The remarkable empirical discovery underlying all scaling laws is that language model loss follows a power law relationship with respect to model size, dataset size, and compute. This means that as you increase any of these quantities, loss decreases predictably as a straight line on a log-log plot. Formally, the relationship takes the form:

L(x) = a \cdot x -α + L \infty

Here, x is the quantity being scaled (parameters, tokens, or FLOPs), a is a constant, α is the scaling exponent (typically between 0.05 and 0.10), and L_∞ is the irreducible loss (the entropy of natural language itself). The irreducible loss represents the theoretical limit: no model, regardless of size, can predict language perfectly because language is inherently stochastic.

This relationship holds across many orders of magnitude, which is what makes it practically useful. You can train a series of small models, fit a power law curve, and then extrapolate to predict the loss of a much larger model.

2. Kaplan Scaling Laws (2020)

The foundational work by Kaplan et al. at OpenAI established three key relationships. First, loss scales as a power law with model parameters N (number of non-embedding parameters):

L(N) \approx (N c / N) α N, α N \approx 0.076

Second, loss scales as a power law with dataset size D (number of tokens):

L(D) \approx (D c / D) α D, α D \approx 0.095

Third, loss scales as a power law with compute budget C (in FLOPs (floating-point operations, a count; not to be confused with FLOPS, which measures operations per second)):

L(C) \approx (C c / C) α C, α C \approx 0.050

The Kaplan Compute-Optimal Recipe

A critical conclusion from the Kaplan analysis was that, given a fixed compute budget, you should prioritize increasing model size over increasing the number of training tokens. Specifically, Kaplan found that as compute increases by 10x, you should scale model size by roughly 5x but only increase data by about 2x. This led to a generation of very large models trained on relatively modest amounts of data, exemplified by GPT-3 (175B parameters trained on 300B tokens).

⚠ Important Nuance

Kaplan's experiments did not train models to convergence. The largest models were stopped early, which biased the results toward favoring larger models. The Chinchilla work later corrected this methodological issue.

3. Chinchilla Scaling Laws (2022)

Hoffmann et al. at DeepMind revisited scaling with more careful experimental design, training over 400 models ranging from 70M to 16B parameters. Their key methodological improvement was training each model to near-convergence on its dataset, eliminating the early-stopping bias in the Kaplan analysis.

The Chinchilla result was striking: for a compute-optimal training run, the number of parameters and the number of training tokens should scale equally. The combined loss is modeled as:

L(N, D) = E + A / N α + B / D β

where α ≈ 0.34, β ≈ 0.28, E ≈ 1.69 (the irreducible entropy), and A, B are constants. Minimizing this loss subject to a compute constraint C ≈ 6ND yields the compute-optimal allocation:

N opt \propto C 0.50, D opt \propto C 0.50

This means parameters and tokens should be scaled at roughly the same rate. The practical implication is that a 70B model should be trained on approximately 1.4 trillion tokens (a ratio of about 20 tokens per parameter).

Figure 6.3.1: The Kaplan approach favors larger models with less data, while the Chinchilla approach recommends equal scaling of parameters and tokens.

Why Chinchilla Changed Everything

The Chinchilla result implied that many existing models were significantly undertrained. Gopher (280B parameters trained on 300B tokens) was revealed to be suboptimal: a 70B model trained on 1.4T tokens (Chinchilla) matched or exceeded Gopher on nearly every benchmark, while being 4x smaller and therefore 4x cheaper to serve at inference time. This triggered a major shift in the industry. Post-Chinchilla models like LLaMA were designed with much larger data-to-parameter ratios.

4. Beyond Chinchilla: Over-Training for Inference

While Chinchilla defines the compute-optimal point for a single training run, real-world deployments face a different optimization problem. A model is trained once but serves millions of inference requests. From this total cost perspective, it can be economical to train a smaller model on far more data than is compute-optimal, paying more in training compute to reduce inference cost per query.

The LLaMA family exemplifies this strategy. LLaMA-1 7B was trained on 1 trillion tokens, giving a ratio of approximately 143 tokens per parameter, roughly 7x beyond the Chinchilla-optimal ratio. LLaMA-2 was trained on 2 trillion tokens. The rationale: the additional training cost is paid once, but the smaller model saves compute on every single inference call.

⚡ Key Insight

Chinchilla-optimal is not deployment-optimal. If you plan to serve a model to millions of users, you should train a smaller model for longer. The key metric shifts from "minimize training FLOPs for a given loss" to "minimize total cost of ownership (training + inference) for a given loss."

ⓘ Terminology: FLOPs vs. FLOPS

FLOPs (floating-point operations, lowercase 's') counts the total number of arithmetic operations performed. FLOPS (floating-point operations per second, uppercase 'S') measures throughput. When we say "a training run used 10²⁴ FLOPs," we mean total operations. When we say "an H100 delivers 989 TFLOPS," we mean operations per second. Confusing the two is a common source of errors in compute budget calculations.

🔮 Where This Leads Next: Inference-Time Scaling

The scaling laws discussed so far govern train-time compute: investing more resources during training to improve the model. A complementary paradigm, inference-time scaling, invests additional compute during each inference request to improve output quality. Rather than building a larger model, you let the same model "think longer." This approach, embodied by OpenAI's o1/o3 and DeepSeek-R1, creates an entirely new scaling law. See Section 7.3 for the full treatment.

5. Data-Constrained Scaling

A growing concern in the LLM community is the potential exhaustion of high-quality training data. Muennighoff et al. (2023) studied what happens when the Chinchilla-optimal token count exceeds available data. Their findings suggest that repeating data up to 4 epochs causes minimal degradation in performance, but beyond that, the value of additional repetitions diminishes rapidly. For a given compute budget C with a data budget D_max, the effective token count follows:

D eff \approx D max \cdot (1 - e -R)

where R = D_total/D_max is the number of epochs. This diminishing-returns formula implies that once you have exhausted your data budget, the marginal benefit of additional epochs is exponentially decaying.

6. Emergent Capabilities and Phase Transitions

One of the most debated phenomena in LLM scaling is emergence: the apparent sudden appearance of new capabilities at certain model sizes. Tasks like arithmetic, chain-of-thought reasoning, and multi-step logic appear to be absent in small models and then abruptly appear in larger ones. Wei et al. (2022) catalogued over 100 such emergent tasks across the BIG-Bench benchmark suite.

The Metric Mirage Hypothesis

Schaeffer et al. (2023) challenged the notion of sharp emergence. Their key argument: whether a capability appears "emergent" depends heavily on the choice of evaluation metric. With discrete metrics like exact-match accuracy, performance looks flat at zero until a threshold is crossed, creating the illusion of a sudden phase transition. When the same tasks are measured with continuous metrics (like token-level log-likelihood), performance improves smoothly and predictably. The capability was always improving; the metric just could not detect the gradual progress.

Figure 6.3.2: The same underlying capability can appear emergent or smoothly scaling depending on the evaluation metric chosen.

7. Multi-Token Prediction and Scaling

Multi-token prediction (MTP), introduced in Section 6.2, has interesting implications for scaling. By training the model to predict not just the next token but several future tokens simultaneously, MTP provides richer gradient signals per training step. Research from Meta (2024) showed that the benefits of MTP become more pronounced at larger model scales: while small models see modest improvements, models beyond 7B parameters show consistently better sample efficiency and downstream performance.

From a scaling law perspective, MTP effectively shifts the loss curve downward, achieving the same loss at lower compute. This is not a change to the scaling exponent but rather to the constant factor, suggesting that MTP models are more efficient per FLOP.

8. Practical Lab: Fitting Scaling Law Curves

The following code demonstrates how to fit a scaling law from empirical training runs and extrapolate predictions for larger models.

import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt

# Empirical data: (parameters, final_loss) from small training runs
params = np.array([1e6, 5e6, 2e7, 5e7, 1e8, 5e8])
losses = np.array([4.20, 3.75, 3.35, 3.15, 2.98, 2.70])

# Power law model: L(N) = a * N^(-alpha) + L_inf
def scaling_law(N, a, alpha, L_inf):
    return a * N ** (-alpha) + L_inf

# Fit the curve
popt, pcov = curve_fit(
    scaling_law, params, losses,
    p0=[100, 0.07, 1.5],     # initial guesses
    bounds=([0, 0, 0], [1e6, 1.0, 5.0])
)
a_fit, alpha_fit, L_inf_fit = popt
print(f"Fitted: a={a_fit:.2f}, alpha={alpha_fit:.4f}, L_inf={L_inf_fit:.3f}")

# Predict loss for larger model sizes
target_sizes = [1e9, 7e9, 70e9]
for size in target_sizes:
    predicted = scaling_law(size, *popt)
    print(f"  {size/1e9:.0f}B params => predicted loss: {predicted:.3f}")

Fitted: a=11.23, alpha=0.0712, L_inf=1.824 1B params => predicted loss: 2.534 7B params => predicted loss: 2.321 70B params => predicted loss: 2.138

Computing the Chinchilla-Optimal Allocation

def chinchilla_optimal(compute_budget_flops):
    """
    Given a FLOPs budget, compute the Chinchilla-optimal
    model size (N) and token count (D).

    Uses the approximation: C = 6 * N * D
    Chinchilla ratio: D = 20 * N
    Therefore: C = 6 * N * 20 * N = 120 * N^2
    """
    N_opt = (compute_budget_flops / 120) ** 0.5
    D_opt = 20 * N_opt
    return N_opt, D_opt

# Example compute budgets
budgets = {
    "Small (1e19 FLOPs)":  1e19,
    "Medium (1e21 FLOPs)": 1e21,
    "Large (1e23 FLOPs)":  1e23,
    "GPT-4 scale (1e25)":  1e25,
}

for name, budget in budgets.items():
    N, D = chinchilla_optimal(budget)
    print(f"{name}:")
    print(f"  Optimal model size: {N/1e9:.1f}B parameters")
    print(f"  Optimal data:       {D/1e9:.0f}B tokens")
    print()

Small (1e19 FLOPs): Optimal model size: 0.3B parameters Optimal data: 6B tokens Medium (1e21 FLOPs): Optimal model size: 2.9B parameters Optimal data: 58B tokens Large (1e23 FLOPs): Optimal model size: 28.9B parameters Optimal data: 577B tokens GPT-4 scale (1e25): Optimal model size: 288.7B parameters Optimal data: 5774B tokens

9. Summary Table: Scaling Regimes

Approach	Tokens/Param Ratio	Priority	Example
Kaplan	~2	Maximize model size	GPT-3 (175B, 300B tok)
Chinchilla	~20	Balance N and D equally	Chinchilla (70B, 1.4T tok)
Over-training	50-200+	Minimize inference cost	LLaMA-1 7B (1T tok)
Data-constrained	Limited by data	Use repeats + augmentation	Low-resource languages

Check Your Understanding

1. Why did Chinchilla outperform Gopher despite being 4x smaller?

Show Answer

Gopher (280B parameters) was trained on only 300B tokens, giving a ratio of roughly 1 token per parameter. The Chinchilla scaling laws show this is far from optimal: the model was severely undertrained. Chinchilla (70B) was trained on 1.4T tokens (20 tokens per parameter), which is much closer to compute-optimal. The extra data compensated for the smaller model size and even surpassed the larger model's performance, because the undertrained large model was effectively wasting its parameter capacity.

2. When would you intentionally deviate from the Chinchilla-optimal ratio?

Show Answer

You would over-train a smaller model (train well beyond the Chinchilla ratio) when you expect high inference volume. The additional training cost is a one-time expense, while the smaller model saves on every inference call. LLaMA trained a 7B model on 1T tokens (143 tokens per parameter). You would also deviate when data is scarce (you cannot reach the optimal token count) or when you have regulatory constraints on model size for deployment.

3. Explain how the choice of evaluation metric can create or dissolve the appearance of emergent capabilities.

Show Answer

Discrete metrics like exact-match accuracy require the model to produce a fully correct answer. Below a certain capability threshold, even partial correctness scores zero, making the performance curve look flat. When the model crosses the threshold, accuracy jumps sharply, creating the illusion of emergence. Continuous metrics like per-token log-likelihood capture the gradual improvement in the model's probability distribution over answers. Under these metrics, the same task shows smooth, predictable improvement with scale, consistent with the power-law behavior of scaling laws.

4. What does the "6ND" approximation represent in the context of compute budgets?

Show Answer

The total training FLOPs for a transformer can be approximated as C ≈ 6ND, where N is the number of model parameters and D is the number of training tokens. The factor of 6 comes from 2 FLOPs per parameter per token in the forward pass (a multiply and an add for each parameter), multiplied by 3 for the forward, backward, and gradient computation passes. This approximation is widely used for back-of-the-envelope compute budgeting.

Key Takeaways

Power laws are predictable: Language model loss follows power-law scaling with parameters, data, and compute, enabling extrapolation from small experiments to large models.
Kaplan (2020) found that model size should be prioritized over data, leading to a generation of large but undertrained models.
Chinchilla (2022) corrected this, showing parameters and tokens should scale equally (roughly 20 tokens per parameter for compute-optimal training).
Over-training is rational when inference cost matters: train a smaller model on more data than is compute-optimal to reduce serving costs.
Data constraints pose a real threat to scaling. Data repetition works up to about 4 epochs before returns diminish sharply.
Emergent capabilities may be partly an artifact of discrete evaluation metrics; continuous metrics often reveal smooth, predictable improvement.

ⓘ Note

Where this leads next: The scaling laws in this section govern training-time compute allocation. But scaling laws also apply at inference time: spending more compute during generation (via search, verification, and chain-of-thought) can dramatically improve output quality. We explore this frontier in Section 7.3 (Reasoning Models and Test-Time Compute).