Section 5.2: Stochastic Sampling Methods

★ Big Picture

Why add randomness? Deterministic decoding (Section 5.1) produces the same output every time, which is great for translation but terrible for creative writing, conversation, and brainstorming. Human language is inherently varied: ask ten people to complete the same sentence, and you will get ten different answers. Stochastic sampling introduces controlled randomness into the decoding process, producing diverse, interesting, human-like text. The challenge is finding the right balance: too little randomness yields repetitive, robotic text; too much yields incoherent gibberish. This section covers every major technique for controlling that balance.

📝 Decision Framework: Choosing a Decoding Method

Use this table as a quick reference when configuring generation parameters.

Method	What It Controls	Typical Range	Best For
Temperature	Sharpness of the probability distribution	0.1 to 1.5	Global creativity dial; use lower for factual tasks, higher for brainstorming
Top-k	Hard cap on number of candidate tokens	10 to 100	Simple truncation; good baseline for casual generation
Top-p (Nucleus)	Cumulative probability threshold for candidates	0.8 to 0.99	Adaptive truncation; adapts to each token's confidence level
Min-p	Minimum probability relative to the top token	0.01 to 0.1	Pruning junk tokens; pairs well with temperature
Repetition Penalty	Penalty for tokens already generated	1.0 to 1.3	Reducing loops and repetitive phrases in long outputs
Typical Sampling	Filters by information content (surprisal)	0.8 to 0.99 (mass param)	Producing text that matches human-like entropy patterns

1. Pure Random Sampling

The most direct form of stochastic decoding is ancestral sampling: at each step, sample the next token from the full probability distribution. If the model says "the" has probability 0.15, "a" has 0.10, "quantum" has 0.0001, and so on across the entire 50,000-token vocabulary, you sample according to those exact probabilities.

This produces maximally diverse output, but the quality is often poor. The long tail of the vocabulary contains thousands of tokens that are individually very unlikely but collectively hold significant probability mass. Even if each improbable token has only a 0.001% chance, with 50,000 tokens in the vocabulary, sampling from the full distribution occasionally draws rare and contextually inappropriate words, derailing the generation.

2. Temperature Scaling

Temperature is the most fundamental control knob for stochastic sampling. Before applying softmax, we divide the logits by a temperature parameter T:

P(x i) = exp(z i / T) / Σ j exp(z j / T)

The effect is intuitive:

T = 1.0: The original distribution (no modification)
T < 1.0: Sharpens the distribution, making high-probability tokens even more dominant. At T → 0, sampling becomes greedy decoding.
T > 1.0: Flattens the distribution, giving low-probability tokens a better chance. At T → ∞, all tokens become equally likely (uniform sampling).

Figure 5.3: Temperature controls the "peakiness" of the distribution. Lower temperatures concentrate probability on top tokens; higher temperatures spread it more evenly.

import torch
import torch.nn.functional as F

# Simulating temperature effect on a small vocabulary
logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."]

for temp in [0.3, 0.7, 1.0, 1.5, 2.0]:
    probs = F.softmax(logits / temp, dim=-1)
    top_prob = probs[0].item()
    entropy = -(probs * probs.log()).sum().item()
    print(f"T={temp:.1f} | P('the')={top_prob:.3f} | entropy={entropy:.3f} | dist={[f'{p:.3f}' for p in probs.tolist()]}")

📝 Practical Guidance

Common temperature ranges: 0.1 to 0.4 for factual Q&A and code generation (favoring accuracy); 0.6 to 0.8 for general conversation; 0.9 to 1.2 for creative writing and brainstorming. Temperatures above 1.5 are rarely useful in production. Most API providers (OpenAI, Anthropic, Google) expose temperature as a parameter, and it is typically the first knob users should tune.

3. Top-k Sampling

Top-k sampling (Fan et al., 2018) restricts sampling to the k most probable tokens at each step. All other tokens have their probability set to zero, and the remaining probabilities are renormalized to sum to 1.

P'(x i) = P(x i) / Σ j \in top-k P(x j) if x i is in the top-k, else 0

This eliminates the long tail problem: no matter how flat the distribution is, only k tokens are ever considered. However, top-k has a significant limitation: the optimal value of k varies depending on the context. When the model is very confident (e.g., after "The capital of France is"), even k=10 might include irrelevant tokens. When the model is uncertain (e.g., after "I enjoy"), k=10 might be too restrictive, cutting off perfectly valid continuations.

def top_k_sampling(logits, k=50, temperature=1.0):
    """Apply top-k filtering then sample from the result."""
    # Apply temperature
    scaled_logits = logits / temperature

    # Find the k-th largest value as threshold
    top_k_values, _ = torch.topk(scaled_logits, k)
    threshold = top_k_values[..., -1, None]

    # Zero out everything below threshold
    filtered = scaled_logits.masked_fill(scaled_logits < threshold, float('-inf'))

    # Convert to probabilities and sample
    probs = F.softmax(filtered, dim=-1)
    return torch.multinomial(probs, num_samples=1)

# Example: sampling with different k values
logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."]

for k in [2, 4, 6]:
    filtered = logits.clone()
    threshold = torch.topk(filtered, k).values[-1]
    filtered[filtered < threshold] = float('-inf')
    probs = F.softmax(filtered, dim=-1)
    active = [f"{tokens[i]}({probs[i]:.3f})" for i in range(len(tokens)) if probs[i] > 0]
    print(f"k={k}: {', '.join(active)}")

k=2: the(0.818), cat(0.182) k=4: the(0.596), cat(0.218), dog(0.099), it(0.049) k=6: the(0.526), cat(0.193), dog(0.087), it(0.043), my(0.031), old(0.024)

4. Nucleus (Top-p) Sampling

Nucleus sampling (Holtzman et al., 2020) addresses top-k's fixed-size problem with an elegant idea: instead of keeping a fixed number of tokens, keep the smallest set of tokens whose cumulative probability exceeds a threshold p. This adapts automatically to the shape of the distribution.

V p = smallest set such that Σ x \in V p P(x) \geq p

When the model is confident, the nucleus might contain only 2 or 3 tokens. When the model is uncertain, it might contain 100 or more. This adaptivity is what makes top-p the most widely used sampling method in production systems.

Figure 5.4: Top-p sampling adapts the number of candidate tokens to model confidence. When confident, few tokens suffice; when uncertain, the nucleus expands.

def top_p_sampling(logits, p=0.9, temperature=1.0):
    """Apply nucleus (top-p) filtering then sample."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)

    # Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find the cutoff: first index where cumulative prob exceeds p
    # We keep tokens up to (but not including) this cutoff
    sorted_mask = cumulative_probs - sorted_probs > p
    sorted_probs[sorted_mask] = 0.0

    # Renormalize
    sorted_probs /= sorted_probs.sum()

    # Sample from filtered distribution
    sampled_index = torch.multinomial(sorted_probs, num_samples=1)
    return sorted_indices[sampled_index]

# Demonstrate adaptive behavior
confident_logits = torch.tensor([8.0, 4.0, 1.0, 0.5, 0.1, -1.0, -2.0, -3.0])
uncertain_logits = torch.tensor([2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8, 0.5])

for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]:
    probs = F.softmax(logits, dim=-1)
    sorted_probs, _ = torch.sort(probs, descending=True)
    cumsum = torch.cumsum(sorted_probs, dim=-1)
    nucleus_size = (cumsum < 0.9).sum().item() + 1
    print(f"{name}: nucleus size = {nucleus_size} tokens for p=0.9")
    print(f"  Probs: {[f'{p:.3f}' for p in sorted_probs.tolist()]}")
    print(f"  Cumsum: {[f'{c:.3f}' for c in cumsum.tolist()]}\n")

Confident: nucleus size = 2 tokens for p=0.9 Probs: ['0.935', '0.054', '0.003', '0.002', '0.001', '0.000', '0.000', '0.000'] Cumsum: ['0.935', '0.989', '0.991', '0.993', '0.994', '0.996', '0.998', '1.000'] Uncertain: nucleus size = 6 tokens for p=0.9 Probs: ['0.181', '0.167', '0.153', '0.141', '0.129', '0.119', '0.073', '0.038'] Cumsum: ['0.181', '0.348', '0.501', '0.642', '0.771', '0.890', '0.962', '1.000']

Common Misconception: Temperature and Top-p Are Not Redundant

Temperature reshapes the entire probability distribution (sharper or flatter). Top-p then truncates the reshaped distribution by removing the tail. Setting temperature=0.1 with top-p=0.9 is almost identical to temperature=0.1 alone, because the distribution is already so peaked that the nucleus contains only 1 to 2 tokens. To see top-p's effect, you need moderate temperature (0.7 to 1.0).

5. Min-p Sampling

Min-p sampling is a newer technique that takes a different approach to adaptive filtering. Instead of specifying a cumulative probability threshold, min-p sets a minimum relative probability: any token whose probability is less than min_p × max_probability is discarded.

Keep token x i if P(x i) \geq min_p \times max j P(x j)

This is conceptually simple and has appealing properties. When the model is very confident (top token at 0.95), even a min_p of 0.1 only keeps tokens above 0.095, resulting in a tiny nucleus. When the model is uncertain (top token at 0.05), the threshold drops to 0.005, allowing many tokens through. The behavior adapts naturally without the cumulative probability bookkeeping of top-p.

def min_p_sampling(logits, min_p=0.1, temperature=1.0):
    """Apply min-p filtering then sample."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)

    # Threshold: min_p * max probability
    max_prob = probs.max()
    threshold = min_p * max_prob

    # Zero out tokens below threshold
    filtered_probs = probs.clone()
    filtered_probs[probs < threshold] = 0.0

    # Renormalize and sample
    filtered_probs /= filtered_probs.sum()
    return torch.multinomial(filtered_probs, num_samples=1)

# Compare min-p behavior
for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]:
    probs = F.softmax(logits, dim=-1)
    max_p = probs.max().item()
    threshold = 0.1 * max_p
    kept = (probs >= threshold).sum().item()
    print(f"{name}: max_p={max_p:.3f}, threshold={threshold:.4f}, kept={kept} tokens")

Confident: max_p=0.935, threshold=0.0935, kept=1 tokens Uncertain: max_p=0.181, threshold=0.0181, kept=8 tokens

6. Typical Sampling

Typical sampling (Meister et al., 2023) takes an information-theoretic approach. The idea is that humans tend to produce words that are neither too predictable nor too surprising. Formally, typical sampling keeps tokens whose information content (negative log-probability) is close to the entropy of the distribution (the expected information content).

A token with probability 0.9 carries very little surprise (low information). A token with probability 0.0001 carries enormous surprise. Typical sampling favors the middle ground: tokens that are about as surprising as you would expect on average. This tends to produce text that feels natural and avoids both boring and incoherent extremes.

💡 Key Insight

Typical sampling reframes the generation question: instead of asking "which tokens are most probable?" it asks "which tokens are most typical given the model's uncertainty?" This is a subtle but important distinction. In a high-entropy context, typical tokens might have relatively low individual probability, while in a low-entropy context, only the top 1 or 2 tokens are typical.

7. Repetition Penalty, Frequency Penalty, Presence Penalty

Even with good sampling strategies, language models tend to repeat themselves. Several penalty mechanisms address this:

Repetition Penalty

Introduced by Keskar et al. (2019), repetition penalty directly modifies the logits of tokens that have already appeared in the generated text:

z' i = z i / θ if token i has appeared before (and z i > 0) z' i = z i \times θ if token i has appeared before (and z i < 0)

Here θ > 1 reduces the probability of repeated tokens. A value of θ = 1.0 means no penalty; values of 1.1 to 1.3 are common.

Frequency and Presence Penalties

Popularized by the OpenAI API, these work by subtracting from logits based on token counts:

Frequency penalty: Subtracts α × count(token) from the logit. Penalizes tokens proportionally to how often they have appeared. Good for reducing word-level repetition.
Presence penalty: Subtracts β from the logit if the token has appeared at all (regardless of count). Encourages the model to explore new topics.

def apply_repetition_penalty(logits, generated_ids, penalty=1.2):
    """Apply repetition penalty to logits for already-generated tokens."""
    for token_id in set(generated_ids.tolist()):
        if logits[token_id] > 0:
            logits[token_id] /= penalty
        else:
            logits[token_id] *= penalty
    return logits

def apply_frequency_presence_penalty(logits, token_counts,
                                     freq_penalty=0.5, presence_penalty=0.5):
    """Apply OpenAI-style frequency and presence penalties."""
    for token_id, count in token_counts.items():
        logits[token_id] -= freq_penalty * count
        logits[token_id] -= presence_penalty  # flat penalty if present
    return logits

# Demonstration
logits = torch.tensor([4.0, 3.0, 2.5, 2.0, 1.5])
tokens = ["the", "cat", "sat", "on", "mat"]
generated = torch.tensor([0, 1, 2])  # "the", "cat", "sat" already generated

original_probs = F.softmax(logits, dim=-1)
penalized = apply_repetition_penalty(logits.clone(), generated, penalty=1.3)
penalized_probs = F.softmax(penalized, dim=-1)

print("Token     | Original | Penalized")
for i, t in enumerate(tokens):
    marker = " *" if i in generated.tolist() else ""
    print(f"{t:9s} | {original_probs[i]:.4f}   | {penalized_probs[i]:.4f}{marker}")

Token | Original | Penalized the | 0.3792 | 0.2010 * cat | 0.1395 | 0.0893 * sat | 0.0847 | 0.0580 * on | 0.0514 | 0.1001 mat | 0.0312 | 0.0608

Notice how the penalty shifts probability mass from already-generated tokens ("the," "cat," "sat") toward new tokens ("on," "mat"), encouraging the model to avoid repetition.

8. Combining Sampling Methods

In practice, these methods are often combined. A typical pipeline might apply transformations in this order:

Repetition penalty on the raw logits
Temperature scaling
Top-k filtering (if used)
Top-p filtering
Sample from the remaining distribution

⚠ Common Pitfall

Applying both top-k and top-p simultaneously can produce unexpected behavior. If k=50 but p=0.9 only covers 5 tokens, the effective filter is top-p (more restrictive). If k=5 but p=0.99 covers 200 tokens, the effective filter is top-k. Be intentional about which filter is the binding constraint, and consider using only one at a time unless you have a specific reason to combine them.

9. Lab: Visualizing Sampling Distributions

import torch
import torch.nn.functional as F

# Simulate a realistic token distribution from a language model
torch.manual_seed(42)
logits = torch.randn(100)  # 100 tokens for visualization
logits[0] = 5.0   # make a few tokens clearly dominant
logits[1] = 3.5
logits[2] = 3.0

methods = {
    "Original (T=1.0)": F.softmax(logits, dim=-1),
    "T=0.5": F.softmax(logits / 0.5, dim=-1),
    "T=1.5": F.softmax(logits / 1.5, dim=-1),
}

# Top-k=10
top_k_logits = logits.clone()
threshold = torch.topk(top_k_logits, 10).values[-1]
top_k_logits[top_k_logits < threshold] = float('-inf')
methods["Top-k=10"] = F.softmax(top_k_logits, dim=-1)

# Top-p=0.9
probs = F.softmax(logits, dim=-1)
sorted_p, sorted_i = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_p, dim=-1)
mask = cumsum - sorted_p > 0.9
sorted_p[mask] = 0
sorted_p /= sorted_p.sum()
top_p_probs = torch.zeros_like(probs)
top_p_probs.scatter_(0, sorted_i, sorted_p)
methods["Top-p=0.9"] = top_p_probs

for name, probs in methods.items():
    nonzero = (probs > 1e-6).sum().item()
    top1 = probs.max().item()
    entropy = -(probs[probs > 0] * probs[probs > 0].log()).sum().item()
    print(f"{name:20s} | active tokens: {nonzero:3d} | top-1 prob: {top1:.4f} | entropy: {entropy:.3f}")

This output reveals the key differences. Low temperature (T=0.5) makes the distribution very peaked, with the top token getting 81% of the probability mass. Top-p=0.9 is the most restrictive here, keeping only 5 tokens and achieving the lowest entropy. These numbers help you develop intuition for how each method reshapes the probability landscape.

Modify and Observe

Change the temperature from 0.5 to 0.01. How many "active tokens" effectively remain? What happens to the entropy?
Try combining top-k=10 with temperature=0.5. Which constraint is the binding one? Is the result different from using top-k=10 alone?
Set top-p to 0.99 vs. 0.5. How does the number of active tokens change? At what p value do you start losing important candidates?

❓ Section Quiz

1. What is the key advantage of top-p sampling over top-k sampling?

Show Answer

Top-p (nucleus) sampling adapts the number of candidate tokens to the model's confidence at each step. When the model is confident, the nucleus is small; when uncertain, it expands. Top-k always keeps exactly k tokens regardless of the distribution shape, which can be too many for confident predictions or too few for uncertain ones.

2. If you set temperature to 0.0 (or very close to 0), what decoding strategy does sampling become equivalent to?

Show Answer

As temperature approaches 0, the softmax distribution becomes infinitely peaked on the highest-logit token, making sampling equivalent to greedy decoding. All probability mass concentrates on a single token, so sampling always selects that token.

3. Why might you use a frequency penalty instead of a repetition penalty?

Show Answer

Frequency penalty scales linearly with how many times a token has appeared (count-based), while repetition penalty applies the same multiplicative factor regardless of count. Frequency penalty is better suited for cases where occasional repetition of a word is acceptable, but excessive repetition (e.g., repeating "the" 15 times) should be strongly penalized. Repetition penalty treats the first repetition the same as the tenth.

4. A user wants creative, diverse story generation. Which combination of parameters would you recommend: (a) T=0.3, top-p=0.5 or (b) T=0.9, top-p=0.95, presence_penalty=0.6?

Show Answer

Option (b) is far better for creative generation. T=0.9 keeps the distribution relatively broad, top-p=0.95 allows many tokens to be considered, and the presence penalty encourages the model to explore new vocabulary and topics. Option (a) would produce very focused, conservative text (low temperature and tight nucleus), which is better suited for factual tasks where creativity is undesirable.

📌 Key Takeaways

Temperature is the most fundamental knob: it scales logits before softmax, controlling how peaked or flat the distribution is. Lower values favor focus; higher values favor diversity.
Top-k restricts sampling to a fixed number of tokens. Simple but not adaptive to context.
Top-p (nucleus) keeps the smallest set of tokens exceeding a cumulative probability threshold, adapting naturally to model confidence. It is the most widely used method in production.
Min-p filters by a minimum relative probability, offering an alternative adaptive approach with a simpler conceptual model.
Typical sampling selects tokens whose information content is close to the distribution entropy, producing naturally "surprising" but not shocking text.
Repetition/frequency/presence penalties combat the tendency of language models to repeat themselves. Choose based on whether you want count-proportional or binary penalization.
In practice, these methods are combined: temperature + top-p + repetition penalty is a common default configuration.