Module 05 · Section 5.1

Deterministic Decoding Strategies

Greedy search, beam search, and the art of choosing the most likely sequence

Greedy decoding always picks the most probable next token. I do the same with snacks, and the results are equally predictable.

Argmax Andy, a deterministic decision maker
★ Big Picture

Why does decoding matter? A language model assigns probabilities to every possible next token. But probabilities are not text. To actually produce a sentence, you need a decoding algorithm that walks through the probability landscape and selects tokens one by one. The simplest approach, greedy decoding, just picks the highest-probability token at each step. It is fast, but it often misses better sequences that require a locally suboptimal choice early on. Beam search addresses this by exploring multiple candidates in parallel. Understanding these deterministic strategies is essential because they form the foundation for every more sophisticated method we will encounter later in this chapter.

1. The Decoding Problem

In Module 04, you built a Transformer that outputs logits at each position. Now we learn what to do with those logits. After a language model completes its forward pass, the final layer produces a vector of logits, one value per token in the vocabulary. Applying softmax converts these logits into a probability distribution over the vocabulary. The decoding algorithm then decides which token to emit and append to the context for the next step.

Formally, given a prompt x1, ..., xt, the model computes:

P(xt+1 | x1, ..., xt) = softmax(logitst+1)

The question is: how do we select xt+1 from this distribution? And then xt+2, xt+3, and so on until we reach an end-of-sequence token or a maximum length? This is the decoding problem, and there is no single correct answer. Different applications call for different strategies.

📝 Terminology Note

Deterministic decoding means the same input always produces the same output (no randomness). Stochastic decoding (Section 5.2) introduces randomness by sampling from the distribution. Deterministic methods are preferred when you need reproducibility, such as in machine translation, summarization, or structured data extraction.

2. Greedy Decoding

The Algorithm

Greedy decoding is the simplest possible strategy: at each time step, select the token with the highest probability. No lookahead, no alternatives, no backtracking.

xt+1 = argmaxv P(v | x1, ..., xt)

This is computationally cheap (one forward pass per token, one argmax per step) and easy to implement. For many straightforward tasks, it works surprisingly well. But it has a fundamental flaw: it is locally optimal, not globally optimal. The highest-probability token at step t might lead to a lower-probability sequence overall compared to a slightly less probable choice at step t that opens up a much better continuation.

Greedy Decoding: Locally Optimal Choices Can Miss Better Sequences The cat (0.6) old (0.3) sat (0.4) cat (0.9) on (0.5) purred (0.8) Greedy path 0.6 x 0.4 x 0.5 = 0.12 Better path 0.3 x 0.9 x 0.8 = 0.216 Greedy decoding picks "cat" (p=0.6) over "old" (p=0.3) at step 1. But "The old cat purred" has a higher joint probability (0.216) than "The cat sat on" (0.12). The locally best choice was globally suboptimal.
Figure 5.1: Greedy decoding selects the highest-probability token at each step, but can miss higher-probability complete sequences.

Implementation

import torch
import torch.nn.functional as F

def greedy_decode(model, input_ids, max_new_tokens=50, eos_token_id=None):
    """Generate tokens using greedy decoding."""
    generated = input_ids.clone()

    for _ in range(max_new_tokens):
        # Forward pass: get logits for next token
        with torch.no_grad():
            outputs = model(generated)
            next_token_logits = outputs.logits[:, -1, :]  # (batch, vocab)

        # Greedy: pick the token with highest probability
        next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)

        # Append to sequence
        generated = torch.cat([generated, next_token], dim=-1)

        # Stop if we hit EOS
        if eos_token_id is not None and next_token.item() == eos_token_id:
            break

    return generated

# Example usage with GPT-2
from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

prompt = "The future of artificial intelligence"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

output = greedy_decode(model, input_ids, max_new_tokens=30)
print(tokenizer.decode(output[0]))
The future of artificial intelligence is not just about the ability to create machines that can think and act like humans. It is about the ability to create machines that can

Notice how the output is coherent but somewhat generic and repetitive. The phrase "the ability to create machines that can" appears twice. This is a well-known problem with greedy decoding: because it always picks the most likely token, it tends to fall into repetitive loops and produces bland, "safe" text.

⚠ Common Pitfall

Greedy decoding often produces degenerate repetition, particularly for open-ended generation. The model finds a high-probability pattern and gets stuck in a loop. This is less of an issue for constrained tasks (translation, summarization) where the output space is narrower.

3. Beam Search

The Intuition

Beam search addresses the myopia of greedy decoding by maintaining multiple candidate sequences (called beams) at each time step. Instead of committing to a single best token, it keeps the top k best partial sequences and expands all of them in parallel. The parameter k is called the beam width.

At each step, beam search expands every beam by considering all vocabulary tokens, scores the resulting candidates by their cumulative log-probability, and keeps only the top k. This allows it to discover sequences where an early suboptimal choice leads to a much better overall result.

Beam Search with Beam Width k=2 Step 0 Step 1 Step 2 Step 3 "The" "The cat" -1.2 "The old" -1.5 "The cat sat" -2.1 "The old cat" -1.9 "The cat is" -2.8 pruned "The old man" -3.1 pruned "The cat sat on" -2.6 "The old cat purred" -2.3 ✔ Best beam! Scores are cumulative log-probabilities. Lower magnitude = higher probability. At each step, only the top k=2 beams survive.
Figure 5.2: Beam search (k=2) tracks multiple hypotheses. The best final sequence may not start with the greediest first token.

The Algorithm in Detail

  1. Initialize: Start with a single beam containing the prompt. Its score is 0 (log-probability).
  2. Expand: For each active beam, compute the probability distribution over the next token. This gives k × V candidates (where V is vocabulary size).
  3. Score: Each candidate's score is the parent beam's cumulative log-probability plus the log-probability of the new token.
  4. Prune: Keep only the top k candidates across all expansions.
  5. Terminate: If a beam produces an EOS token, move it to a "completed" list. Continue until all beams are complete or a maximum length is reached.
  6. Select: Return the completed beam with the highest score (optionally with length normalization).
import torch
import torch.nn.functional as F

def beam_search(model, input_ids, beam_width=4, max_new_tokens=50,
               eos_token_id=None, length_penalty=1.0):
    """Beam search decoding with length normalization."""
    vocab_size = model.config.vocab_size
    device = input_ids.device

    # Each beam: (sequence_tensor, cumulative_log_prob)
    beams = [(input_ids[0], 0.0)]
    completed = []

    for step in range(max_new_tokens):
        all_candidates = []

        for seq, score in beams:
            if eos_token_id and seq[-1].item() == eos_token_id:
                completed.append((seq, score))
                continue

            # Get next token probabilities
            with torch.no_grad():
                outputs = model(seq.unsqueeze(0))
                log_probs = F.log_softmax(outputs.logits[0, -1, :], dim=-1)

            # Get top-k tokens for this beam
            top_log_probs, top_indices = log_probs.topk(beam_width)

            for i in range(beam_width):
                new_seq = torch.cat([seq, top_indices[i].unsqueeze(0)])
                new_score = score + top_log_probs[i].item()
                all_candidates.append((new_seq, new_score))

        if not all_candidates:
            break

        # Keep top beam_width candidates (with length normalization)
        all_candidates.sort(
            key=lambda x: x[1] / (len(x[0]) ** length_penalty),
            reverse=True
        )
        beams = all_candidates[:beam_width]

    # Return best completed beam, or best active beam
    all_final = completed + beams
    best = max(all_final,
               key=lambda x: x[1] / (len(x[0]) ** length_penalty))
    return best[0].unsqueeze(0)

# Compare greedy vs. beam search
prompt = "The future of artificial intelligence"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

greedy_out = greedy_decode(model, input_ids, max_new_tokens=20)
beam_out = beam_search(model, input_ids, beam_width=5, max_new_tokens=20)

print("Greedy:", tokenizer.decode(greedy_out[0]))
print("Beam-5:", tokenizer.decode(beam_out[0]))
Greedy: The future of artificial intelligence is not just about the ability to create machines that can think and act like humans. Beam-5: The future of artificial intelligence is a topic that has been discussed for decades, and it is one that has been
💡 Key Insight

Beam search does not guarantee finding the globally optimal sequence. It is a heuristic: with beam width k, it explores k hypotheses, not the full exponential space of all possible sequences. Increasing k improves coverage but increases computation linearly. In practice, beam widths of 4 to 10 cover most use cases. Very large beam widths often degrade output quality for open-ended generation (a phenomenon sometimes called the beam search curse).

4. Length Normalization and Length Penalty

Raw beam search has a systematic bias: it prefers shorter sequences. Why? Because each additional token multiplies a probability less than 1 (or adds a negative log-probability), so longer sequences always have lower cumulative scores. This means beam search will gravitate toward short, generic completions.

Length Normalization

The standard fix is to normalize the score by the sequence length. Instead of ranking beams by their raw log-probability, we divide by length raised to a power α:

score(y) = log P(y1, ..., yT) / Tα

The parameter α (alpha) controls the strength of normalization:

Why Length Normalization Is Necessary

Log-probabilities are negative (since probabilities are less than 1). Summing more negative terms produces a more negative total. Without normalization, a 5-token sequence will almost always score higher (less negative) than a 20-token sequence, even if the 20-token sequence is more informative. Length normalization divides the total log-probability by sequence length, correcting this bias toward shorter outputs.

# Demonstrating the effect of length normalization
import numpy as np

# Two hypothetical beam candidates
short_seq_logprob = -8.5   # 5 tokens
long_seq_logprob  = -15.2  # 12 tokens

for alpha in [0.0, 0.5, 0.7, 1.0]:
    short_score = short_seq_logprob / (5 ** alpha)
    long_score  = long_seq_logprob / (12 ** alpha)
    winner = "Short" if short_score > long_score else "Long"
    print(f"alpha={alpha:.1f}: short={short_score:.3f}, long={long_score:.3f} => {winner} wins")
alpha=0.0: short=-8.500, long=-15.200 => Short wins alpha=0.5: short=-3.801, long=-4.389 => Short wins alpha=0.7: short=-2.553, long=-2.375 => Long wins alpha=1.0: short=-1.700, long=-1.267 => Long wins

In this example, the longer sequence becomes the winner once α exceeds roughly 0.65. The Google Neural Machine Translation paper (Wu et al., 2016) found that α = 0.6 to 0.8 works well for translation tasks.

5. Constrained Beam Search

Standard beam search explores freely across the entire vocabulary. But many real-world applications require the output to include specific tokens or phrases. For example:

Constrained beam search modifies the standard algorithm so that beams are only considered "complete" if they satisfy a set of lexical constraints. The technique, introduced by Anderson et al. (2017) and refined by Post and Vilar (2018), organizes beams into groups based on how many constraints they have fulfilled. At each step, some beams are allowed to advance toward fulfilling a constraint (by being nudged toward the required tokens), while others continue generating freely.

# Using HuggingFace's built-in constrained beam search
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import PhrasalConstraint

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "The research team announced"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Force the output to include "breakthrough" and "neural network"
constraints = [
    PhrasalConstraint(tokenizer.encode("breakthrough", add_special_tokens=False)),
    PhrasalConstraint(tokenizer.encode("neural network", add_special_tokens=False)),
]

output = model.generate(
    input_ids,
    constraints=constraints,
    num_beams=10,
    max_new_tokens=30,
    no_repeat_ngram_size=2,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
The research team announced a breakthrough in the field of neural network research, which could lead to new ways of understanding the brain.

Constrained beam search is especially powerful for controlled generation pipelines where the output must satisfy verifiable requirements. The computational cost is higher than standard beam search because the algorithm must track constraint satisfaction state for each beam, but the guarantee of constraint fulfillment often justifies the overhead.

6. When to Use Deterministic Decoding

Method Best For Weaknesses Typical Settings
Greedy Quick prototyping, classification-like tasks, short completions Repetitive, misses better sequences N/A (no hyperparameters)
Beam Search Translation, summarization, structured output Bland output, short-sequence bias without length penalty k=4 to 8, α=0.6 to 0.8
Constrained Beam Controlled generation, data-to-text, forced terminology Higher compute, constraint specification can be complex k=8 to 15, constraints as needed
📝 Practical Guidance

For open-ended generation (chatbots, creative writing, brainstorming), deterministic decoding is usually a poor choice. The repetitive, generic outputs it produces feel lifeless to users. Section 5.2 introduces stochastic sampling, which injects controlled randomness to produce more diverse, human-like text. Deterministic methods shine when there is a single "correct" output (or a narrow set of correct outputs), as in translation and factual summarization.

7. Lab: Greedy vs. Beam Search on Summarization

In this lab, we compare greedy decoding and beam search on a summarization task using a pretrained model. The goal is to observe how beam search produces more fluent and complete summaries.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load a small summarization model
model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

article = """
Scientists at MIT have developed a new method for training neural networks
that reduces energy consumption by up to 70 percent. The technique, called
sparse adaptive training, selectively updates only the most important
parameters during each training step. Initial experiments on language models
show that the approach maintains accuracy while dramatically cutting
computational costs. The team plans to release their code as open source.
"""

inputs = tokenizer(article.strip(), return_tensors="pt", max_length=512, truncation=True)

# Greedy decoding
greedy_ids = model.generate(**inputs, max_new_tokens=60, num_beams=1)
print("=== Greedy ===")
print(tokenizer.decode(greedy_ids[0], skip_special_tokens=True))

# Beam search (k=5, length_penalty=1.0)
beam_ids = model.generate(**inputs, max_new_tokens=60, num_beams=5, length_penalty=1.0)
print("\n=== Beam Search (k=5) ===")
print(tokenizer.decode(beam_ids[0], skip_special_tokens=True))

# Beam search with stronger length penalty
beam_ids_lp = model.generate(**inputs, max_new_tokens=60, num_beams=5, length_penalty=2.0)
print("\n=== Beam Search (k=5, length_penalty=2.0) ===")
print(tokenizer.decode(beam_ids_lp[0], skip_special_tokens=True))
=== Greedy === MIT scientists have developed a new method for training neural networks that reduces energy consumption by up to 70 percent. === Beam Search (k=5) === Scientists at MIT have developed a new method for training neural networks. The technique reduces energy consumption by up to 70 percent. The team plans to release their code as open source. === Beam Search (k=5, length_penalty=2.0) === Scientists at MIT have developed a new method for training neural networks that reduces energy consumption by up to 70 percent. The technique selectively updates only the most important parameters during each training step. The team plans to release their code as open source.

Observe how beam search with length penalty 2.0 produces the most complete summary, capturing all key details from the article. The greedy summary is accurate but omits the open-source release and the technique name. This illustrates the practical value of beam search and length normalization for tasks where completeness matters.

❓ Section Quiz

1. Why does greedy decoding sometimes produce worse overall sequences than beam search?

Show Answer
Greedy decoding makes locally optimal choices at each step, picking the single highest-probability token. However, this can lead the model down a path where subsequent tokens have low probability. Beam search avoids this by maintaining multiple candidate sequences in parallel, allowing it to discover paths where an early "second-best" choice leads to a much higher joint probability overall.

2. What happens to beam search output quality as beam width increases beyond 10 or 20 for open-ended generation?

Show Answer
Counterintuitively, very large beam widths often degrade output quality for open-ended generation. The algorithm converges on extremely high-probability but generic and repetitive sequences. This is sometimes called the "beam search curse." The optimal beam width depends on the task; for translation, beam widths of 4 to 8 work well, while for creative generation, stochastic methods (Section 5.2) are preferred.

3. A beam search without length normalization tends to produce shorter outputs. Why?

Show Answer
Each token added to a sequence contributes a negative log-probability (since all probabilities are between 0 and 1). Longer sequences accumulate more negative terms, so their total score is always lower than shorter sequences. Without length normalization (dividing by Tα), the algorithm systematically favors shorter sequences that have fewer negative terms, regardless of their per-token quality.

4. Give an example of a task where constrained beam search would be more appropriate than standard beam search.

Show Answer
Data-to-text generation is a strong example. Suppose you are generating a product description from a structured database record containing attributes like brand name, price, and color. Constrained beam search can force all of these values to appear verbatim in the output, ensuring factual accuracy. Standard beam search might omit some attributes if the model considers them unlikely in context.

📌 Key Takeaways