Module 05 · Section 5.2

Stochastic Sampling Methods

Temperature, top-k, nucleus sampling, min-p, and the art of controlled randomness

At temperature 0.0, I am a boring but reliable narrator. At temperature 2.0, I am a jazz musician who has lost the sheet music.

Stochastic Steve, a sampling method with range
★ Big Picture

Why add randomness? Deterministic decoding (Section 5.1) produces the same output every time, which is great for translation but terrible for creative writing, conversation, and brainstorming. Human language is inherently varied: ask ten people to complete the same sentence, and you will get ten different answers. Stochastic sampling introduces controlled randomness into the decoding process, producing diverse, interesting, human-like text. The challenge is finding the right balance: too little randomness yields repetitive, robotic text; too much yields incoherent gibberish. This section covers every major technique for controlling that balance.

📝 Decision Framework: Choosing a Decoding Method

Use this table as a quick reference when configuring generation parameters.

Method What It Controls Typical Range Best For
Temperature Sharpness of the probability distribution 0.1 to 1.5 Global creativity dial; use lower for factual tasks, higher for brainstorming
Top-k Hard cap on number of candidate tokens 10 to 100 Simple truncation; good baseline for casual generation
Top-p (Nucleus) Cumulative probability threshold for candidates 0.8 to 0.99 Adaptive truncation; adapts to each token's confidence level
Min-p Minimum probability relative to the top token 0.01 to 0.1 Pruning junk tokens; pairs well with temperature
Repetition Penalty Penalty for tokens already generated 1.0 to 1.3 Reducing loops and repetitive phrases in long outputs
Typical Sampling Filters by information content (surprisal) 0.8 to 0.99 (mass param) Producing text that matches human-like entropy patterns

1. Pure Random Sampling

The most direct form of stochastic decoding is ancestral sampling: at each step, sample the next token from the full probability distribution. If the model says "the" has probability 0.15, "a" has 0.10, "quantum" has 0.0001, and so on across the entire 50,000-token vocabulary, you sample according to those exact probabilities.

This produces maximally diverse output, but the quality is often poor. The long tail of the vocabulary contains thousands of tokens that are individually very unlikely but collectively hold significant probability mass. Even if each improbable token has only a 0.001% chance, with 50,000 tokens in the vocabulary, sampling from the full distribution occasionally draws rare and contextually inappropriate words, derailing the generation.

2. Temperature Scaling

Temperature is the most fundamental control knob for stochastic sampling. Before applying softmax, we divide the logits by a temperature parameter T:

P(xi) = exp(zi / T) / Σj exp(zj / T)

The effect is intuitive:

Effect of Temperature on Token Probability Distribution Tokens (sorted by probability) Probability the cat dog it my old an ... T = 0.3 (sharp) T = 1.0 (original) T = 2.0 (flat)
Figure 5.3: Temperature controls the "peakiness" of the distribution. Lower temperatures concentrate probability on top tokens; higher temperatures spread it more evenly.
import torch
import torch.nn.functional as F

# Simulating temperature effect on a small vocabulary
logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."]

for temp in [0.3, 0.7, 1.0, 1.5, 2.0]:
    probs = F.softmax(logits / temp, dim=-1)
    top_prob = probs[0].item()
    entropy = -(probs * probs.log()).sum().item()
    print(f"T={temp:.1f} | P('the')={top_prob:.3f} | entropy={entropy:.3f} | dist={[f'{p:.3f}' for p in probs.tolist()]}")
T=0.3 | P('the')=0.946 | entropy=0.279 | dist=['0.946', '0.046', '0.006', '0.001', '0.001', '0.000', '0.000', '0.000'] T=0.7 | P('the')=0.685 | entropy=0.957 | dist=['0.685', '0.166', '0.056', '0.024', '0.017', '0.013', '0.006', '0.002'] T=1.0 | P('the')=0.513 | entropy=1.382 | dist=['0.513', '0.188', '0.085', '0.042', '0.030', '0.023', '0.009', '0.003'] T=1.5 | P('the')=0.340 | entropy=1.753 | dist=['0.340', '0.194', '0.118', '0.075', '0.060', '0.050', '0.029', '0.015'] T=2.0 | P('the')=0.268 | entropy=1.933 | dist=['0.268', '0.186', '0.132', '0.096', '0.081', '0.070', '0.048', '0.030']
📝 Practical Guidance

Common temperature ranges: 0.1 to 0.4 for factual Q&A and code generation (favoring accuracy); 0.6 to 0.8 for general conversation; 0.9 to 1.2 for creative writing and brainstorming. Temperatures above 1.5 are rarely useful in production. Most API providers (OpenAI, Anthropic, Google) expose temperature as a parameter, and it is typically the first knob users should tune.

3. Top-k Sampling

Top-k sampling (Fan et al., 2018) restricts sampling to the k most probable tokens at each step. All other tokens have their probability set to zero, and the remaining probabilities are renormalized to sum to 1.

P'(xi) = P(xi) / Σj ∈ top-k P(xj)    if xi is in the top-k, else 0

This eliminates the long tail problem: no matter how flat the distribution is, only k tokens are ever considered. However, top-k has a significant limitation: the optimal value of k varies depending on the context. When the model is very confident (e.g., after "The capital of France is"), even k=10 might include irrelevant tokens. When the model is uncertain (e.g., after "I enjoy"), k=10 might be too restrictive, cutting off perfectly valid continuations.

def top_k_sampling(logits, k=50, temperature=1.0):
    """Apply top-k filtering then sample from the result."""
    # Apply temperature
    scaled_logits = logits / temperature

    # Find the k-th largest value as threshold
    top_k_values, _ = torch.topk(scaled_logits, k)
    threshold = top_k_values[..., -1, None]

    # Zero out everything below threshold
    filtered = scaled_logits.masked_fill(scaled_logits < threshold, float('-inf'))

    # Convert to probabilities and sample
    probs = F.softmax(filtered, dim=-1)
    return torch.multinomial(probs, num_samples=1)

# Example: sampling with different k values
logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."]

for k in [2, 4, 6]:
    filtered = logits.clone()
    threshold = torch.topk(filtered, k).values[-1]
    filtered[filtered < threshold] = float('-inf')
    probs = F.softmax(filtered, dim=-1)
    active = [f"{tokens[i]}({probs[i]:.3f})" for i in range(len(tokens)) if probs[i] > 0]
    print(f"k={k}: {', '.join(active)}")
k=2: the(0.818), cat(0.182) k=4: the(0.596), cat(0.218), dog(0.099), it(0.049) k=6: the(0.526), cat(0.193), dog(0.087), it(0.043), my(0.031), old(0.024)

4. Nucleus (Top-p) Sampling

Nucleus sampling (Holtzman et al., 2020) addresses top-k's fixed-size problem with an elegant idea: instead of keeping a fixed number of tokens, keep the smallest set of tokens whose cumulative probability exceeds a threshold p. This adapts automatically to the shape of the distribution.

Vp = smallest set such that Σx ∈ Vp P(x) ≥ p

When the model is confident, the nucleus might contain only 2 or 3 tokens. When the model is uncertain, it might contain 100 or more. This adaptivity is what makes top-p the most widely used sampling method in production systems.

Top-p (Nucleus) Sampling: Adaptive Vocabulary Size Confident (peaked) Paris Lyon Rome ... 0.70 0.22 nucleus (p=0.9): 2 tokens Uncertain (flat) eat run play read swim ... 0.22 0.19 0.17 0.15 0.12 nucleus (p=0.9): 5 tokens
Figure 5.4: Top-p sampling adapts the number of candidate tokens to model confidence. When confident, few tokens suffice; when uncertain, the nucleus expands.
def top_p_sampling(logits, p=0.9, temperature=1.0):
    """Apply nucleus (top-p) filtering then sample."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)

    # Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find the cutoff: first index where cumulative prob exceeds p
    # We keep tokens up to (but not including) this cutoff
    sorted_mask = cumulative_probs - sorted_probs > p
    sorted_probs[sorted_mask] = 0.0

    # Renormalize
    sorted_probs /= sorted_probs.sum()

    # Sample from filtered distribution
    sampled_index = torch.multinomial(sorted_probs, num_samples=1)
    return sorted_indices[sampled_index]

# Demonstrate adaptive behavior
confident_logits = torch.tensor([8.0, 4.0, 1.0, 0.5, 0.1, -1.0, -2.0, -3.0])
uncertain_logits = torch.tensor([2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8, 0.5])

for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]:
    probs = F.softmax(logits, dim=-1)
    sorted_probs, _ = torch.sort(probs, descending=True)
    cumsum = torch.cumsum(sorted_probs, dim=-1)
    nucleus_size = (cumsum < 0.9).sum().item() + 1
    print(f"{name}: nucleus size = {nucleus_size} tokens for p=0.9")
    print(f"  Probs: {[f'{p:.3f}' for p in sorted_probs.tolist()]}")
    print(f"  Cumsum: {[f'{c:.3f}' for c in cumsum.tolist()]}\n")
Confident: nucleus size = 2 tokens for p=0.9 Probs: ['0.935', '0.054', '0.003', '0.002', '0.001', '0.000', '0.000', '0.000'] Cumsum: ['0.935', '0.989', '0.991', '0.993', '0.994', '0.996', '0.998', '1.000'] Uncertain: nucleus size = 6 tokens for p=0.9 Probs: ['0.181', '0.167', '0.153', '0.141', '0.129', '0.119', '0.073', '0.038'] Cumsum: ['0.181', '0.348', '0.501', '0.642', '0.771', '0.890', '0.962', '1.000']
Common Misconception: Temperature and Top-p Are Not Redundant

Temperature reshapes the entire probability distribution (sharper or flatter). Top-p then truncates the reshaped distribution by removing the tail. Setting temperature=0.1 with top-p=0.9 is almost identical to temperature=0.1 alone, because the distribution is already so peaked that the nucleus contains only 1 to 2 tokens. To see top-p's effect, you need moderate temperature (0.7 to 1.0).

5. Min-p Sampling

Min-p sampling is a newer technique that takes a different approach to adaptive filtering. Instead of specifying a cumulative probability threshold, min-p sets a minimum relative probability: any token whose probability is less than min_p × max_probability is discarded.

Keep token xi if P(xi) ≥ min_p × maxj P(xj)

This is conceptually simple and has appealing properties. When the model is very confident (top token at 0.95), even a min_p of 0.1 only keeps tokens above 0.095, resulting in a tiny nucleus. When the model is uncertain (top token at 0.05), the threshold drops to 0.005, allowing many tokens through. The behavior adapts naturally without the cumulative probability bookkeeping of top-p.

def min_p_sampling(logits, min_p=0.1, temperature=1.0):
    """Apply min-p filtering then sample."""
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)

    # Threshold: min_p * max probability
    max_prob = probs.max()
    threshold = min_p * max_prob

    # Zero out tokens below threshold
    filtered_probs = probs.clone()
    filtered_probs[probs < threshold] = 0.0

    # Renormalize and sample
    filtered_probs /= filtered_probs.sum()
    return torch.multinomial(filtered_probs, num_samples=1)

# Compare min-p behavior
for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]:
    probs = F.softmax(logits, dim=-1)
    max_p = probs.max().item()
    threshold = 0.1 * max_p
    kept = (probs >= threshold).sum().item()
    print(f"{name}: max_p={max_p:.3f}, threshold={threshold:.4f}, kept={kept} tokens")
Confident: max_p=0.935, threshold=0.0935, kept=1 tokens Uncertain: max_p=0.181, threshold=0.0181, kept=8 tokens

6. Typical Sampling

Typical sampling (Meister et al., 2023) takes an information-theoretic approach. The idea is that humans tend to produce words that are neither too predictable nor too surprising. Formally, typical sampling keeps tokens whose information content (negative log-probability) is close to the entropy of the distribution (the expected information content).

A token with probability 0.9 carries very little surprise (low information). A token with probability 0.0001 carries enormous surprise. Typical sampling favors the middle ground: tokens that are about as surprising as you would expect on average. This tends to produce text that feels natural and avoids both boring and incoherent extremes.

💡 Key Insight

Typical sampling reframes the generation question: instead of asking "which tokens are most probable?" it asks "which tokens are most typical given the model's uncertainty?" This is a subtle but important distinction. In a high-entropy context, typical tokens might have relatively low individual probability, while in a low-entropy context, only the top 1 or 2 tokens are typical.

7. Repetition Penalty, Frequency Penalty, Presence Penalty

Even with good sampling strategies, language models tend to repeat themselves. Several penalty mechanisms address this:

Repetition Penalty

Introduced by Keskar et al. (2019), repetition penalty directly modifies the logits of tokens that have already appeared in the generated text:

z'i = zi / θ   if token i has appeared before (and zi > 0)
z'i = zi × θ   if token i has appeared before (and zi < 0)

Here θ > 1 reduces the probability of repeated tokens. A value of θ = 1.0 means no penalty; values of 1.1 to 1.3 are common.

Frequency and Presence Penalties

Popularized by the OpenAI API, these work by subtracting from logits based on token counts:

def apply_repetition_penalty(logits, generated_ids, penalty=1.2):
    """Apply repetition penalty to logits for already-generated tokens."""
    for token_id in set(generated_ids.tolist()):
        if logits[token_id] > 0:
            logits[token_id] /= penalty
        else:
            logits[token_id] *= penalty
    return logits

def apply_frequency_presence_penalty(logits, token_counts,
                                     freq_penalty=0.5, presence_penalty=0.5):
    """Apply OpenAI-style frequency and presence penalties."""
    for token_id, count in token_counts.items():
        logits[token_id] -= freq_penalty * count
        logits[token_id] -= presence_penalty  # flat penalty if present
    return logits

# Demonstration
logits = torch.tensor([4.0, 3.0, 2.5, 2.0, 1.5])
tokens = ["the", "cat", "sat", "on", "mat"]
generated = torch.tensor([0, 1, 2])  # "the", "cat", "sat" already generated

original_probs = F.softmax(logits, dim=-1)
penalized = apply_repetition_penalty(logits.clone(), generated, penalty=1.3)
penalized_probs = F.softmax(penalized, dim=-1)

print("Token     | Original | Penalized")
for i, t in enumerate(tokens):
    marker = " *" if i in generated.tolist() else ""
    print(f"{t:9s} | {original_probs[i]:.4f}   | {penalized_probs[i]:.4f}{marker}")
Token | Original | Penalized the | 0.3792 | 0.2010 * cat | 0.1395 | 0.0893 * sat | 0.0847 | 0.0580 * on | 0.0514 | 0.1001 mat | 0.0312 | 0.0608

Notice how the penalty shifts probability mass from already-generated tokens ("the," "cat," "sat") toward new tokens ("on," "mat"), encouraging the model to avoid repetition.

8. Combining Sampling Methods

In practice, these methods are often combined. A typical pipeline might apply transformations in this order:

  1. Repetition penalty on the raw logits
  2. Temperature scaling
  3. Top-k filtering (if used)
  4. Top-p filtering
  5. Sample from the remaining distribution
⚠ Common Pitfall

Applying both top-k and top-p simultaneously can produce unexpected behavior. If k=50 but p=0.9 only covers 5 tokens, the effective filter is top-p (more restrictive). If k=5 but p=0.99 covers 200 tokens, the effective filter is top-k. Be intentional about which filter is the binding constraint, and consider using only one at a time unless you have a specific reason to combine them.

9. Lab: Visualizing Sampling Distributions

import torch
import torch.nn.functional as F

# Simulate a realistic token distribution from a language model
torch.manual_seed(42)
logits = torch.randn(100)  # 100 tokens for visualization
logits[0] = 5.0   # make a few tokens clearly dominant
logits[1] = 3.5
logits[2] = 3.0

methods = {
    "Original (T=1.0)": F.softmax(logits, dim=-1),
    "T=0.5": F.softmax(logits / 0.5, dim=-1),
    "T=1.5": F.softmax(logits / 1.5, dim=-1),
}

# Top-k=10
top_k_logits = logits.clone()
threshold = torch.topk(top_k_logits, 10).values[-1]
top_k_logits[top_k_logits < threshold] = float('-inf')
methods["Top-k=10"] = F.softmax(top_k_logits, dim=-1)

# Top-p=0.9
probs = F.softmax(logits, dim=-1)
sorted_p, sorted_i = torch.sort(probs, descending=True)
cumsum = torch.cumsum(sorted_p, dim=-1)
mask = cumsum - sorted_p > 0.9
sorted_p[mask] = 0
sorted_p /= sorted_p.sum()
top_p_probs = torch.zeros_like(probs)
top_p_probs.scatter_(0, sorted_i, sorted_p)
methods["Top-p=0.9"] = top_p_probs

for name, probs in methods.items():
    nonzero = (probs > 1e-6).sum().item()
    top1 = probs.max().item()
    entropy = -(probs[probs > 0] * probs[probs > 0].log()).sum().item()
    print(f"{name:20s} | active tokens: {nonzero:3d} | top-1 prob: {top1:.4f} | entropy: {entropy:.3f}")
Original (T=1.0) | active tokens: 100 | top-1 prob: 0.4228 | entropy: 2.541 T=0.5 | active tokens: 100 | top-1 prob: 0.8106 | entropy: 0.889 T=1.5 | active tokens: 100 | top-1 prob: 0.2408 | entropy: 3.378 Top-k=10 | active tokens: 10 | top-1 prob: 0.5497 | entropy: 1.557 Top-p=0.9 | active tokens: 5 | top-1 prob: 0.4698 | entropy: 1.269

This output reveals the key differences. Low temperature (T=0.5) makes the distribution very peaked, with the top token getting 81% of the probability mass. Top-p=0.9 is the most restrictive here, keeping only 5 tokens and achieving the lowest entropy. These numbers help you develop intuition for how each method reshapes the probability landscape.

Modify and Observe

❓ Section Quiz

1. What is the key advantage of top-p sampling over top-k sampling?

Show Answer
Top-p (nucleus) sampling adapts the number of candidate tokens to the model's confidence at each step. When the model is confident, the nucleus is small; when uncertain, it expands. Top-k always keeps exactly k tokens regardless of the distribution shape, which can be too many for confident predictions or too few for uncertain ones.

2. If you set temperature to 0.0 (or very close to 0), what decoding strategy does sampling become equivalent to?

Show Answer
As temperature approaches 0, the softmax distribution becomes infinitely peaked on the highest-logit token, making sampling equivalent to greedy decoding. All probability mass concentrates on a single token, so sampling always selects that token.

3. Why might you use a frequency penalty instead of a repetition penalty?

Show Answer
Frequency penalty scales linearly with how many times a token has appeared (count-based), while repetition penalty applies the same multiplicative factor regardless of count. Frequency penalty is better suited for cases where occasional repetition of a word is acceptable, but excessive repetition (e.g., repeating "the" 15 times) should be strongly penalized. Repetition penalty treats the first repetition the same as the tenth.

4. A user wants creative, diverse story generation. Which combination of parameters would you recommend: (a) T=0.3, top-p=0.5 or (b) T=0.9, top-p=0.95, presence_penalty=0.6?

Show Answer
Option (b) is far better for creative generation. T=0.9 keeps the distribution relatively broad, top-p=0.95 allows many tokens to be considered, and the presence penalty encourages the model to explore new vocabulary and topics. Option (a) would produce very focused, conservative text (low temperature and tight nucleus), which is better suited for factual tasks where creativity is undesirable.

📌 Key Takeaways