Module 05 · Section 5.4

Diffusion-Based Language Models

Beyond autoregressive generation: discrete diffusion, parallel decoding, and the frontier of text generation

Autoregressive models generate left to right. Diffusion models generate everywhere at once, like a toddler with finger paint.

Diffusion Dan, a parallel-generation pioneer
★ Big Picture

Every technique we have studied so far in this chapter generates text one token at a time, left to right. This autoregressive constraint means that generating a 1,000-token response requires 1,000 sequential forward passes, and no amount of parallelism can change that fundamental bottleneck. Diffusion-based language models break this constraint entirely. Inspired by the diffusion models that revolutionized image generation (DALL-E, Stable Diffusion, Midjourney), these models generate all tokens simultaneously through a process of iterative denoising. This section explores how discrete diffusion works for text, the key models pushing this frontier, and the advantages and limitations of this paradigm.

🔬 Research Frontier

This section covers an active and rapidly evolving research area. The models and results discussed here represent the state of the art as of early 2026, but the field is moving quickly. Treat this material as a snapshot of a fast-developing landscape rather than a settled body of knowledge.

1. From Continuous to Discrete Diffusion

A Quick Review: Diffusion in Images

In image diffusion models, the forward process gradually adds Gaussian noise to an image over many steps until it becomes pure noise. The reverse process learns to denoise step by step, recovering the original image. At generation time, you start from random noise and iteratively denoise to produce a new image. This works beautifully because pixel values are continuous, and Gaussian noise is a natural perturbation in continuous space.

Text, however, is discrete: each position holds a token from a finite vocabulary (e.g., 50,000 words). You cannot add Gaussian noise to the word "cat" in any meaningful sense. This fundamental difference requires a completely different formulation of the diffusion process.

Discrete Diffusion for Text

Instead of adding continuous noise, discrete diffusion corrupts tokens. The most common approach replaces tokens with a special [MASK] token (absorbing diffusion) or with random tokens from the vocabulary (uniform diffusion). Over many forward steps, a clean sentence becomes a sequence of masked or random tokens. The reverse process learns to recover the original tokens from this corrupted state.

Discrete Diffusion: Forward (Corrupt) and Reverse (Denoise) Process FORWARD PROCESS (training: add noise) t = 0 (clean) The cat sat on t = 1 The [M] sat on t = 2 The [M] [M] on t = T (fully masked) [M] [M] [M] [M] ... REVERSE PROCESS (generation: denoise) Start: all masked [M] [M] [M] [M] Denoise step [M] cat [M] [M] Denoise step The cat [M] on Final output The cat sat on ... At each reverse step, the model predicts ALL masked tokens in parallel (not one at a time).
Figure 5.7: Discrete diffusion for text. The forward process masks tokens; the reverse process recovers them. Multiple tokens can be unmasked per step.

2. Key Models and Approaches

📚 Paper Spotlight: MDLM

Masked Diffusion Language Model (MDLM) by Sahoo et al. (2024) provides a clean theoretical framework connecting masked language modeling (like BERT) with continuous-time diffusion. The key insight is that BERT-style masked prediction can be viewed as a single-step diffusion process, and extending it to multiple steps with a proper noise schedule yields a full generative model. MDLM achieves competitive perplexity with autoregressive models on standard benchmarks while enabling parallel generation.

📚 Paper Spotlight: SEDD

Score Entropy Discrete Diffusion (SEDD) by Lou et al. (2024) introduces a score-based framework for discrete diffusion. Instead of directly predicting denoised tokens, SEDD learns a "score function" that describes how the probability of each token changes as noise is added. This is analogous to score-matching in continuous diffusion and provides a principled training objective. SEDD achieves strong results on text generation and demonstrates the viability of score-based approaches in discrete spaces.

📚 Paper Spotlight: LLaDA and Dream

LLaDA (Large Language Diffusion with mAsking, 2025) scales masked diffusion to 8 billion parameters, showing that discrete diffusion models can match autoregressive models in instruction following and reasoning tasks when trained at sufficient scale. Dream (Diffusion Reasoning Model, 2025) extends this with a planner-guided denoising approach that improves coherence for long-form generation. Both models demonstrate that the quality gap between diffusion and autoregressive text generation is closing rapidly.

Comparison of Approaches

Model Noise Type Key Innovation Scale
MDLM Absorbing (mask) Continuous-time diffusion with BERT-like training Up to 1.1B
SEDD Absorbing (mask) Score-matching objective for discrete tokens Up to 1.1B
LLaDA Absorbing (mask) Scaling to 8B with instruction tuning 8B
Dream Absorbing (mask) Planner-guided coherent denoising 7B

3. Parallel Generation: The Speed Advantage

The most exciting property of diffusion language models is parallel token generation. In autoregressive models, generating N tokens requires N sequential forward passes. Each pass depends on the output of the previous one, creating an inherent sequential bottleneck that limits throughput regardless of hardware.

Diffusion models sidestep this entirely. At each denoising step, the model predicts all positions simultaneously. A 1,000-token output might require only 20 to 50 denoising steps, where each step processes the entire sequence in parallel. On modern GPU hardware with massive parallel processing capability, this can translate to order-of-magnitude latency reductions for long outputs.

# Conceptual comparison of generation steps
import numpy as np

def compare_generation_steps(sequence_lengths, diffusion_steps=30):
    """Compare sequential steps needed for AR vs diffusion generation."""
    print(f"{' Length':>8s} | {'AR Steps':>10s} | {'Diffusion Steps':>16s} | {'Speedup':>8s}")
    print("-" * 52)
    for length in sequence_lengths:
        ar_steps = length  # one forward pass per token
        diff_steps = diffusion_steps  # fixed number of denoising steps
        speedup = ar_steps / diff_steps
        print(f"{length:>8d} | {ar_steps:>10d} | {diff_steps:>16d} | {speedup:>7.1f}x")

compare_generation_steps([50, 200, 500, 1000, 4000])
Length | AR Steps | Diffusion Steps | Speedup ---------------------------------------------------- 50 | 50 | 30 | 1.7x 200 | 200 | 30 | 6.7x 500 | 500 | 30 | 16.7x 1000 | 1000 | 30 | 33.3x 4000 | 4000 | 30 | 133.3x

The speedup grows linearly with output length because autoregressive models scale linearly while diffusion models use a fixed number of steps. For a 4,000-token output, the theoretical speedup is over 100x in sequential steps. Of course, each diffusion step processes the entire sequence (which is more expensive per step than a single autoregressive step), so the wall-clock speedup is smaller, typically 3x to 10x for long sequences. Nevertheless, this represents a fundamental architectural advantage.

# Simplified discrete diffusion process (conceptual)
import torch

def simulate_diffusion_generation(seq_length, vocab_size, num_steps=20):
    """
    Simulate the discrete diffusion generation process.
    At each step, the model predicts all masked positions simultaneously.
    We unmask a fraction of positions per step.
    """
    MASK_TOKEN = vocab_size  # special mask token ID

    # Start fully masked
    sequence = torch.full((seq_length,), MASK_TOKEN)
    tokens_per_step = max(1, seq_length // num_steps)

    print(f"Generating {seq_length} tokens in {num_steps} parallel steps\n")

    for step in range(num_steps):
        # Find masked positions
        masked_positions = (sequence == MASK_TOKEN).nonzero(as_tuple=True)[0]
        if len(masked_positions) == 0:
            break

        # "Model prediction": in reality, a neural network predicts all positions
        # Here we simulate by filling with random tokens
        n_to_unmask = min(tokens_per_step, len(masked_positions))
        # Unmask positions with highest "confidence" (simulated)
        chosen = masked_positions[torch.randperm(len(masked_positions))[:n_to_unmask]]
        sequence[chosen] = torch.randint(0, vocab_size, (n_to_unmask,))

        n_remaining = (sequence == MASK_TOKEN).sum().item()
        pct_done = 100 * (1 - n_remaining / seq_length)
        print(f"Step {step+1:2d}: unmasked {n_to_unmask:3d} tokens | {pct_done:5.1f}% complete | {n_remaining:3d} masked")

    return sequence

result = simulate_diffusion_generation(seq_length=100, vocab_size=50000, num_steps=10)
Generating 100 tokens in 10 parallel steps Step 1: unmasked 10 tokens | 10.0% complete | 90 masked Step 2: unmasked 10 tokens | 20.0% complete | 80 masked Step 3: unmasked 10 tokens | 30.0% complete | 70 masked Step 4: unmasked 10 tokens | 40.0% complete | 60 masked Step 5: unmasked 10 tokens | 50.0% complete | 50 masked Step 6: unmasked 10 tokens | 60.0% complete | 40 masked Step 7: unmasked 10 tokens | 70.0% complete | 30 masked Step 8: unmasked 10 tokens | 80.0% complete | 20 masked Step 9: unmasked 10 tokens | 90.0% complete | 10 masked Step 10: unmasked 10 tokens | 100.0% complete | 0 masked

4. Gemini Diffusion

📚 Paper Spotlight: Gemini Diffusion

Google DeepMind's Gemini Diffusion paradigm applies discrete diffusion at the scale of frontier models. While full architectural details remain proprietary, the announced approach combines discrete diffusion with several innovations: adaptive step scheduling (using more denoising steps for complex passages and fewer for simple ones), hierarchical denoising (coarse structure first, fine details later), and integration with the Gemini model family's multimodal capabilities. Early benchmarks suggest latency reductions of 5x to 10x for long-form generation compared to autoregressive Gemini models, with quality approaching (but not yet matching) the autoregressive versions on reasoning-heavy tasks.

The Gemini Diffusion approach represents the first serious industrial investment in diffusion-based text generation at frontier scale. Key reported characteristics include:

5. Advantages and Limitations

Autoregressive vs. Diffusion: Tradeoff Landscape Autoregressive (GPT, Llama) ✔ Superior reasoning quality ✔ Mature ecosystem and tooling ✔ Well-understood training ✔ Easy to apply RLHF ✘ Sequential bottleneck (slow) ✘ Left-to-right only ✘ Cannot revise earlier tokens ✘ Latency scales with output length Diffusion (MDLM, LLaDA) ✔ Parallel generation (fast) ✔ Bidirectional context ✔ Can revise any position ✔ Fixed-step latency ✘ Quality gap on reasoning ✘ Immature tooling ✘ RL post-training is harder ✘ Less interpretable generation
Figure 5.8: Comparing the strengths and weaknesses of autoregressive and diffusion-based text generation paradigms.

The Quality Gap for Reasoning

The most significant current limitation of diffusion language models is their performance on tasks requiring complex multi-step reasoning. Autoregressive models naturally "think step by step" because each token is generated in sequence, allowing each step to build on the previous. Diffusion models generate all positions in parallel, which makes it harder to enforce logical dependencies between distant parts of the output. For tasks like mathematical proofs, code generation, and chain-of-thought reasoning, autoregressive models still hold a significant advantage.

💡 Key Insight

The quality gap is not fundamental but practical. Autoregressive models have had years of optimization, scaling, and alignment research (RLHF, DPO, constitutional AI). Diffusion language models are still in their early stages. The trajectory of improvement suggests that much of the gap may close as researchers develop diffusion-specific alignment techniques, better noise schedules, and larger-scale training.

6. TraceRL: Reinforcement Learning for Diffusion LLMs

📚 Paper Spotlight: TraceRL (ICLR 2026)

TraceRL addresses one of the biggest open problems in diffusion language models: how to apply reinforcement learning from human feedback (RLHF) or other reward-based training. In autoregressive models, RLHF is straightforward because each token is a discrete action in a sequential decision process. In diffusion models, the "action" at each step is a parallel update to all positions, making standard RL algorithms inapplicable. TraceRL introduces a "trace-based" reward attribution that propagates reward signals back through the denoising trajectory, enabling effective RL post-training for diffusion language models. Results show significant improvements in instruction following and helpfulness, narrowing the gap with RLHF-trained autoregressive models.

The TraceRL approach works by treating the entire denoising trajectory as a sequence of decisions:

  1. Generate a complete denoising trajectory: from fully masked to final output, recording the decisions at each step
  2. Score the final output using a reward model (the same kind used in autoregressive RLHF)
  3. Attribute the reward back to each denoising step using a credit assignment mechanism inspired by policy gradient methods
  4. Update the model to increase the probability of denoising trajectories that led to high-reward outputs

This is conceptually similar to how REINFORCE or PPO work in autoregressive models, but adapted for the parallel, iterative structure of diffusion. The "trace" in TraceRL refers to the recorded sequence of denoising steps, which serves as the equivalent of a token-by-token trajectory in autoregressive generation.

# Conceptual pseudocode for TraceRL training loop
# (simplified for illustrative purposes)

def trace_rl_training_step(diffusion_model, reward_model, prompt):
    """One step of TraceRL training (conceptual)."""

    # 1. Generate a denoising trajectory
    trajectory = []
    x_t = initialize_fully_masked(prompt)

    for t in reversed(range(T)):
        # Model predicts which tokens to unmask and what they should be
        predictions, log_probs = diffusion_model.denoise_step(x_t, t)
        trajectory.append({"step": t, "log_probs": log_probs, "state": x_t})
        x_t = apply_predictions(x_t, predictions)

    # 2. Score the final output
    final_text = x_t
    reward = reward_model(prompt, final_text)

    # 3. Compute policy gradient with reward attribution
    loss = 0
    for step_info in trajectory:
        # Attribute reward to each step (with discount)
        step_reward = attribute_reward(reward, step_info["step"], T)
        loss -= step_reward * step_info["log_probs"].mean()

    # 4. Update model
    loss.backward()
    optimizer.step()

    return reward

# This enables RLHF-style training for diffusion language models,
# previously an unsolved problem.
print("TraceRL: enables reward-based training for diffusion LLMs")
print("Key contribution: credit assignment across parallel denoising steps")
TraceRL: enables reward-based training for diffusion LLMs Key contribution: credit assignment across parallel denoising steps

7. The Road Ahead

Diffusion-based language models are at a similar stage to where autoregressive transformers were around 2018 to 2019: the fundamental ideas are in place, early results are promising, but enormous scaling and engineering work remains. Several open questions will determine whether diffusion models become a mainstream alternative to autoregressive generation:

📝 For the Curious Reader

If this section has piqued your interest, we recommend reading the MDLM and SEDD papers for theoretical foundations, the LLaDA paper for practical scaling, and the TraceRL paper for the alignment frontier. The field is moving fast, and new results appear monthly. Follow the proceedings of NeurIPS, ICML, ICLR, and the ArXiv cs.CL category for the latest developments.

❓ Section Quiz

1. Why can we not simply add Gaussian noise to text tokens in the same way diffusion works for images?

Show Answer
Text tokens are discrete (they come from a finite vocabulary), not continuous. Gaussian noise applies to continuous values (like pixel intensities) where small perturbations produce slightly different but meaningful values. Adding Gaussian noise to a token ID (an integer) produces nonsensical results because token IDs have no inherent ordering or distance. Instead, discrete diffusion corrupts tokens by replacing them with a [MASK] token (absorbing diffusion) or with random vocabulary tokens (uniform diffusion).

2. What is the fundamental speed advantage of diffusion language models over autoregressive models?

Show Answer
Autoregressive models generate tokens sequentially, requiring N forward passes for N tokens (linear scaling). Diffusion models generate all tokens in parallel through a fixed number of denoising steps (typically 20 to 50), regardless of output length. For long outputs, this means the number of sequential computation steps is constant rather than proportional to length. The speedup grows linearly with output length, reaching 10x or more for sequences of a few hundred tokens.

3. Why is RLHF harder to apply to diffusion language models than to autoregressive models?

Show Answer
In autoregressive models, each token is a discrete action in a sequential Markov decision process, making it straightforward to define policies, compute action probabilities, and apply policy gradient methods. In diffusion models, the "action" at each step is a parallel update to all positions simultaneously, which does not fit the standard RL framework. The challenge is credit assignment: how to determine which denoising steps contributed most to the final output quality. TraceRL addresses this with trace-based reward attribution.

4. What is the current main weakness of diffusion language models compared to autoregressive models?

Show Answer
The main weakness is performance on complex multi-step reasoning tasks (math, code, logical deduction). Autoregressive models naturally produce coherent chain-of-thought reasoning because each token can depend on all previous tokens. Diffusion models generate all positions in parallel, making it harder to enforce logical dependencies between distant parts of the output. This is a practical rather than fundamental limitation, and recent work (LLaDA, Dream, TraceRL) is actively narrowing this gap.

📌 Key Takeaways

Where This Leads: From Foundations to Real LLMs

Congratulations: you have completed Part I. You now understand the full pipeline from raw text to generated output: tokenization (Module 02), embeddings (Module 01), sequence modeling (Module 03), the Transformer architecture (Module 04), and decoding algorithms (Module 05). In Part II: Understanding LLMs, you will see how these foundations scale. Module 06 covers pretraining and scaling laws: how models like GPT-3 are trained on trillions of tokens and why larger models exhibit emergent capabilities. Module 07 explores fine-tuning and transfer learning. Together, Parts I and II give you the complete picture needed for the applied modules in Parts III and IV.