Why add randomness? Deterministic decoding (Section 5.1) produces the same output every time, which is great for translation but terrible for creative writing, conversation, and brainstorming. Human language is inherently varied: ask ten people to complete the same sentence, and you will get ten different answers. Stochastic sampling introduces controlled randomness into the decoding process, producing diverse, interesting, human-like text. The challenge is finding the right balance: too little randomness yields repetitive, robotic text; too much yields incoherent gibberish. This section covers every major technique for controlling that balance.
Use this table as a quick reference when configuring generation parameters.
| Method | What It Controls | Typical Range | Best For |
|---|---|---|---|
| Temperature | Sharpness of the probability distribution | 0.1 to 1.5 | Global creativity dial; use lower for factual tasks, higher for brainstorming |
| Top-k | Hard cap on number of candidate tokens | 10 to 100 | Simple truncation; good baseline for casual generation |
| Top-p (Nucleus) | Cumulative probability threshold for candidates | 0.8 to 0.99 | Adaptive truncation; adapts to each token's confidence level |
| Min-p | Minimum probability relative to the top token | 0.01 to 0.1 | Pruning junk tokens; pairs well with temperature |
| Repetition Penalty | Penalty for tokens already generated | 1.0 to 1.3 | Reducing loops and repetitive phrases in long outputs |
| Typical Sampling | Filters by information content (surprisal) | 0.8 to 0.99 (mass param) | Producing text that matches human-like entropy patterns |
1. Pure Random Sampling
The most direct form of stochastic decoding is ancestral sampling: at each step, sample the next token from the full probability distribution. If the model says "the" has probability 0.15, "a" has 0.10, "quantum" has 0.0001, and so on across the entire 50,000-token vocabulary, you sample according to those exact probabilities.
This produces maximally diverse output, but the quality is often poor. The long tail of the vocabulary contains thousands of tokens that are individually very unlikely but collectively hold significant probability mass. Even if each improbable token has only a 0.001% chance, with 50,000 tokens in the vocabulary, sampling from the full distribution occasionally draws rare and contextually inappropriate words, derailing the generation.
2. Temperature Scaling
Temperature is the most fundamental control knob for stochastic sampling. Before applying softmax, we divide the logits by a temperature parameter T:
The effect is intuitive:
- T = 1.0: The original distribution (no modification)
- T < 1.0: Sharpens the distribution, making high-probability tokens even more dominant. At T → 0, sampling becomes greedy decoding.
- T > 1.0: Flattens the distribution, giving low-probability tokens a better chance. At T → ∞, all tokens become equally likely (uniform sampling).
import torch import torch.nn.functional as F # Simulating temperature effect on a small vocabulary logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0]) tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."] for temp in [0.3, 0.7, 1.0, 1.5, 2.0]: probs = F.softmax(logits / temp, dim=-1) top_prob = probs[0].item() entropy = -(probs * probs.log()).sum().item() print(f"T={temp:.1f} | P('the')={top_prob:.3f} | entropy={entropy:.3f} | dist={[f'{p:.3f}' for p in probs.tolist()]}")
Common temperature ranges: 0.1 to 0.4 for factual Q&A and code generation (favoring accuracy); 0.6 to 0.8 for general conversation; 0.9 to 1.2 for creative writing and brainstorming. Temperatures above 1.5 are rarely useful in production. Most API providers (OpenAI, Anthropic, Google) expose temperature as a parameter, and it is typically the first knob users should tune.
3. Top-k Sampling
Top-k sampling (Fan et al., 2018) restricts sampling to the k most probable tokens at each step. All other tokens have their probability set to zero, and the remaining probabilities are renormalized to sum to 1.
This eliminates the long tail problem: no matter how flat the distribution is, only k tokens are ever considered. However, top-k has a significant limitation: the optimal value of k varies depending on the context. When the model is very confident (e.g., after "The capital of France is"), even k=10 might include irrelevant tokens. When the model is uncertain (e.g., after "I enjoy"), k=10 might be too restrictive, cutting off perfectly valid continuations.
def top_k_sampling(logits, k=50, temperature=1.0): """Apply top-k filtering then sample from the result.""" # Apply temperature scaled_logits = logits / temperature # Find the k-th largest value as threshold top_k_values, _ = torch.topk(scaled_logits, k) threshold = top_k_values[..., -1, None] # Zero out everything below threshold filtered = scaled_logits.masked_fill(scaled_logits < threshold, float('-inf')) # Convert to probabilities and sample probs = F.softmax(filtered, dim=-1) return torch.multinomial(probs, num_samples=1) # Example: sampling with different k values logits = torch.tensor([5.0, 3.5, 2.0, 1.0, 0.5, 0.1, -1.0, -2.0]) tokens = ["the", "cat", "dog", "it", "my", "old", "an", "..."] for k in [2, 4, 6]: filtered = logits.clone() threshold = torch.topk(filtered, k).values[-1] filtered[filtered < threshold] = float('-inf') probs = F.softmax(filtered, dim=-1) active = [f"{tokens[i]}({probs[i]:.3f})" for i in range(len(tokens)) if probs[i] > 0] print(f"k={k}: {', '.join(active)}")
4. Nucleus (Top-p) Sampling
Nucleus sampling (Holtzman et al., 2020) addresses top-k's fixed-size problem with an elegant idea: instead of keeping a fixed number of tokens, keep the smallest set of tokens whose cumulative probability exceeds a threshold p. This adapts automatically to the shape of the distribution.
When the model is confident, the nucleus might contain only 2 or 3 tokens. When the model is uncertain, it might contain 100 or more. This adaptivity is what makes top-p the most widely used sampling method in production systems.
def top_p_sampling(logits, p=0.9, temperature=1.0): """Apply nucleus (top-p) filtering then sample.""" scaled_logits = logits / temperature probs = F.softmax(scaled_logits, dim=-1) # Sort probabilities in descending order sorted_probs, sorted_indices = torch.sort(probs, descending=True) cumulative_probs = torch.cumsum(sorted_probs, dim=-1) # Find the cutoff: first index where cumulative prob exceeds p # We keep tokens up to (but not including) this cutoff sorted_mask = cumulative_probs - sorted_probs > p sorted_probs[sorted_mask] = 0.0 # Renormalize sorted_probs /= sorted_probs.sum() # Sample from filtered distribution sampled_index = torch.multinomial(sorted_probs, num_samples=1) return sorted_indices[sampled_index] # Demonstrate adaptive behavior confident_logits = torch.tensor([8.0, 4.0, 1.0, 0.5, 0.1, -1.0, -2.0, -3.0]) uncertain_logits = torch.tensor([2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 0.8, 0.5]) for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]: probs = F.softmax(logits, dim=-1) sorted_probs, _ = torch.sort(probs, descending=True) cumsum = torch.cumsum(sorted_probs, dim=-1) nucleus_size = (cumsum < 0.9).sum().item() + 1 print(f"{name}: nucleus size = {nucleus_size} tokens for p=0.9") print(f" Probs: {[f'{p:.3f}' for p in sorted_probs.tolist()]}") print(f" Cumsum: {[f'{c:.3f}' for c in cumsum.tolist()]}\n")
Temperature reshapes the entire probability distribution (sharper or flatter). Top-p then truncates the reshaped distribution by removing the tail. Setting temperature=0.1 with top-p=0.9 is almost identical to temperature=0.1 alone, because the distribution is already so peaked that the nucleus contains only 1 to 2 tokens. To see top-p's effect, you need moderate temperature (0.7 to 1.0).
5. Min-p Sampling
Min-p sampling is a newer technique that takes a different approach to adaptive filtering. Instead of specifying a cumulative probability threshold, min-p sets a minimum relative probability: any token whose probability is less than min_p × max_probability is discarded.
This is conceptually simple and has appealing properties. When the model is very confident (top token at 0.95), even a min_p of 0.1 only keeps tokens above 0.095, resulting in a tiny nucleus. When the model is uncertain (top token at 0.05), the threshold drops to 0.005, allowing many tokens through. The behavior adapts naturally without the cumulative probability bookkeeping of top-p.
def min_p_sampling(logits, min_p=0.1, temperature=1.0): """Apply min-p filtering then sample.""" scaled_logits = logits / temperature probs = F.softmax(scaled_logits, dim=-1) # Threshold: min_p * max probability max_prob = probs.max() threshold = min_p * max_prob # Zero out tokens below threshold filtered_probs = probs.clone() filtered_probs[probs < threshold] = 0.0 # Renormalize and sample filtered_probs /= filtered_probs.sum() return torch.multinomial(filtered_probs, num_samples=1) # Compare min-p behavior for name, logits in [("Confident", confident_logits), ("Uncertain", uncertain_logits)]: probs = F.softmax(logits, dim=-1) max_p = probs.max().item() threshold = 0.1 * max_p kept = (probs >= threshold).sum().item() print(f"{name}: max_p={max_p:.3f}, threshold={threshold:.4f}, kept={kept} tokens")
6. Typical Sampling
Typical sampling (Meister et al., 2023) takes an information-theoretic approach. The idea is that humans tend to produce words that are neither too predictable nor too surprising. Formally, typical sampling keeps tokens whose information content (negative log-probability) is close to the entropy of the distribution (the expected information content).
A token with probability 0.9 carries very little surprise (low information). A token with probability 0.0001 carries enormous surprise. Typical sampling favors the middle ground: tokens that are about as surprising as you would expect on average. This tends to produce text that feels natural and avoids both boring and incoherent extremes.
Typical sampling reframes the generation question: instead of asking "which tokens are most probable?" it asks "which tokens are most typical given the model's uncertainty?" This is a subtle but important distinction. In a high-entropy context, typical tokens might have relatively low individual probability, while in a low-entropy context, only the top 1 or 2 tokens are typical.
7. Repetition Penalty, Frequency Penalty, Presence Penalty
Even with good sampling strategies, language models tend to repeat themselves. Several penalty mechanisms address this:
Repetition Penalty
Introduced by Keskar et al. (2019), repetition penalty directly modifies the logits of tokens that have already appeared in the generated text:
z'i = zi × θ if token i has appeared before (and zi < 0)
Here θ > 1 reduces the probability of repeated tokens. A value of θ = 1.0 means no penalty; values of 1.1 to 1.3 are common.
Frequency and Presence Penalties
Popularized by the OpenAI API, these work by subtracting from logits based on token counts:
- Frequency penalty: Subtracts α × count(token) from the logit. Penalizes tokens proportionally to how often they have appeared. Good for reducing word-level repetition.
- Presence penalty: Subtracts β from the logit if the token has appeared at all (regardless of count). Encourages the model to explore new topics.
def apply_repetition_penalty(logits, generated_ids, penalty=1.2): """Apply repetition penalty to logits for already-generated tokens.""" for token_id in set(generated_ids.tolist()): if logits[token_id] > 0: logits[token_id] /= penalty else: logits[token_id] *= penalty return logits def apply_frequency_presence_penalty(logits, token_counts, freq_penalty=0.5, presence_penalty=0.5): """Apply OpenAI-style frequency and presence penalties.""" for token_id, count in token_counts.items(): logits[token_id] -= freq_penalty * count logits[token_id] -= presence_penalty # flat penalty if present return logits # Demonstration logits = torch.tensor([4.0, 3.0, 2.5, 2.0, 1.5]) tokens = ["the", "cat", "sat", "on", "mat"] generated = torch.tensor([0, 1, 2]) # "the", "cat", "sat" already generated original_probs = F.softmax(logits, dim=-1) penalized = apply_repetition_penalty(logits.clone(), generated, penalty=1.3) penalized_probs = F.softmax(penalized, dim=-1) print("Token | Original | Penalized") for i, t in enumerate(tokens): marker = " *" if i in generated.tolist() else "" print(f"{t:9s} | {original_probs[i]:.4f} | {penalized_probs[i]:.4f}{marker}")
Notice how the penalty shifts probability mass from already-generated tokens ("the," "cat," "sat") toward new tokens ("on," "mat"), encouraging the model to avoid repetition.
8. Combining Sampling Methods
In practice, these methods are often combined. A typical pipeline might apply transformations in this order:
- Repetition penalty on the raw logits
- Temperature scaling
- Top-k filtering (if used)
- Top-p filtering
- Sample from the remaining distribution
Applying both top-k and top-p simultaneously can produce unexpected behavior. If k=50 but p=0.9 only covers 5 tokens, the effective filter is top-p (more restrictive). If k=5 but p=0.99 covers 200 tokens, the effective filter is top-k. Be intentional about which filter is the binding constraint, and consider using only one at a time unless you have a specific reason to combine them.
9. Lab: Visualizing Sampling Distributions
import torch import torch.nn.functional as F # Simulate a realistic token distribution from a language model torch.manual_seed(42) logits = torch.randn(100) # 100 tokens for visualization logits[0] = 5.0 # make a few tokens clearly dominant logits[1] = 3.5 logits[2] = 3.0 methods = { "Original (T=1.0)": F.softmax(logits, dim=-1), "T=0.5": F.softmax(logits / 0.5, dim=-1), "T=1.5": F.softmax(logits / 1.5, dim=-1), } # Top-k=10 top_k_logits = logits.clone() threshold = torch.topk(top_k_logits, 10).values[-1] top_k_logits[top_k_logits < threshold] = float('-inf') methods["Top-k=10"] = F.softmax(top_k_logits, dim=-1) # Top-p=0.9 probs = F.softmax(logits, dim=-1) sorted_p, sorted_i = torch.sort(probs, descending=True) cumsum = torch.cumsum(sorted_p, dim=-1) mask = cumsum - sorted_p > 0.9 sorted_p[mask] = 0 sorted_p /= sorted_p.sum() top_p_probs = torch.zeros_like(probs) top_p_probs.scatter_(0, sorted_i, sorted_p) methods["Top-p=0.9"] = top_p_probs for name, probs in methods.items(): nonzero = (probs > 1e-6).sum().item() top1 = probs.max().item() entropy = -(probs[probs > 0] * probs[probs > 0].log()).sum().item() print(f"{name:20s} | active tokens: {nonzero:3d} | top-1 prob: {top1:.4f} | entropy: {entropy:.3f}")
This output reveals the key differences. Low temperature (T=0.5) makes the distribution very peaked, with the top token getting 81% of the probability mass. Top-p=0.9 is the most restrictive here, keeping only 5 tokens and achieving the lowest entropy. These numbers help you develop intuition for how each method reshapes the probability landscape.
- Change the temperature from 0.5 to 0.01. How many "active tokens" effectively remain? What happens to the entropy?
- Try combining top-k=10 with temperature=0.5. Which constraint is the binding one? Is the result different from using top-k=10 alone?
- Set top-p to 0.99 vs. 0.5. How does the number of active tokens change? At what p value do you start losing important candidates?
❓ Section Quiz
1. What is the key advantage of top-p sampling over top-k sampling?
Show Answer
2. If you set temperature to 0.0 (or very close to 0), what decoding strategy does sampling become equivalent to?
Show Answer
3. Why might you use a frequency penalty instead of a repetition penalty?
Show Answer
4. A user wants creative, diverse story generation. Which combination of parameters would you recommend: (a) T=0.3, top-p=0.5 or (b) T=0.9, top-p=0.95, presence_penalty=0.6?
Show Answer
📌 Key Takeaways
- Temperature is the most fundamental knob: it scales logits before softmax, controlling how peaked or flat the distribution is. Lower values favor focus; higher values favor diversity.
- Top-k restricts sampling to a fixed number of tokens. Simple but not adaptive to context.
- Top-p (nucleus) keeps the smallest set of tokens exceeding a cumulative probability threshold, adapting naturally to model confidence. It is the most widely used method in production.
- Min-p filters by a minimum relative probability, offering an alternative adaptive approach with a simpler conceptual model.
- Typical sampling selects tokens whose information content is close to the distribution entropy, producing naturally "surprising" but not shocking text.
- Repetition/frequency/presence penalties combat the tendency of language models to repeat themselves. Choose based on whether you want count-proportional or binary penalization.
- In practice, these methods are combined: temperature + top-p + repetition penalty is a common default configuration.