Claiming that Model A outperforms Model B on a benchmark means nothing without statistical evidence. LLM evaluations are inherently noisy: outputs vary with random seeds, prompt phrasing, and sampling temperature. A 2% accuracy difference on 100 test examples could easily be due to chance. This section teaches you to design experiments that produce reliable, reproducible conclusions by applying bootstrap confidence intervals, paired statistical tests, effect size reporting, and principled ablation study design. It also covers the critical problem of benchmark contamination and how to detect it.
1. Why Statistical Rigor Matters for LLM Evaluation
Traditional software testing is deterministic: a function either returns the correct value or it does not. LLM evaluation is fundamentally different because the same prompt can produce different outputs across runs (when temperature > 0), different models may perform differently on different subsets of a benchmark, and small benchmark sizes amplify sampling noise. Without statistical rigor, you risk deploying a model based on a performance difference that was nothing more than random fluctuation.
The core principle is straightforward: every evaluation result should come with a measure of uncertainty. A point estimate like "Model A achieves 78.3% accuracy" is incomplete without a confidence interval such as "78.3% ± 2.1% (95% CI)." When comparing two systems, you need a formal test of whether the observed difference is statistically significant.
Common Statistical Mistakes in LLM Papers
- No confidence intervals: Reporting a single accuracy number without any uncertainty quantification
- Inappropriate independence assumptions: Using unpaired tests when both models were evaluated on the same examples
- Multiple comparisons without correction: Comparing many model variants without adjusting for the increased chance of false positives
- Cherry-picked seeds: Running experiments with many random seeds and reporting only the best result
- Ignoring effect size: Claiming a "significant" improvement that is too small to matter in practice
2. Bootstrap Confidence Intervals
The bootstrap is a resampling method that estimates the sampling distribution of a statistic by repeatedly drawing samples (with replacement) from the observed data. It requires no distributional assumptions, which makes it ideal for LLM evaluation metrics that may have unusual distributions (skewed, bounded, multimodal).
The procedure is simple: given n evaluation results, draw B bootstrap samples of size n (with replacement), compute the metric on each sample, and use the distribution of bootstrapped metrics to construct a confidence interval. The percentile method takes the 2.5th and 97.5th percentiles for a 95% confidence interval.
import numpy as np from typing import Callable def bootstrap_ci( scores: list[float], metric_fn: Callable = np.mean, n_bootstrap: int = 10_000, confidence: float = 0.95, seed: int = 42 ) -> dict: """Compute bootstrap confidence interval for any metric. Args: scores: Per-example scores (e.g., accuracy per question) metric_fn: Aggregation function (mean, median, etc.) n_bootstrap: Number of bootstrap resamples confidence: Confidence level (e.g., 0.95 for 95% CI) seed: Random seed for reproducibility """ rng = np.random.default_rng(seed) scores = np.array(scores) n = len(scores) # Generate all bootstrap samples at once for efficiency boot_indices = rng.integers(0, n, size=(n_bootstrap, n)) boot_stats = np.array([metric_fn(scores[idx]) for idx in boot_indices]) alpha = 1 - confidence lower = np.percentile(boot_stats, 100 * alpha / 2) upper = np.percentile(boot_stats, 100 * (1 - alpha / 2)) return { "point_estimate": float(metric_fn(scores)), "ci_lower": float(lower), "ci_upper": float(upper), "confidence": confidence, "n_bootstrap": n_bootstrap, "std_error": float(np.std(boot_stats)) } # Example: binary accuracy scores from a benchmark accuracy_scores = [1,1,0,1,1,1,0,1,0,1,1,1,0,1,1,1,0,1,1,1, 1,0,1,1,1,1,0,1,1,0,1,1,1,0,1,1,1,1,0,1] result = bootstrap_ci(accuracy_scores) print(f"Accuracy: {result['point_estimate']:.3f}") print(f"95% CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]") print(f"Standard Error: {result['std_error']:.4f}")
For standard confidence intervals, 10,000 bootstrap resamples are sufficient. For more precise p-values or very narrow confidence intervals, consider 50,000 or 100,000 resamples. The computation is fast because each resample only involves indexing into an array, not re-running the model.
3. Paired Statistical Tests
When comparing two models, the key question is whether the observed difference in performance is statistically significant. Because both models are evaluated on the same test examples, you should use paired tests that account for the correlation between results on each example. Paired tests are more powerful than unpaired tests because they eliminate example-level variance.
McNemar's Test for Binary Outcomes
McNemar's test is the standard choice when evaluation outcomes are binary (correct or incorrect). It focuses on the discordant pairs: examples where the two models disagree. If Model A gets 20 examples right that Model B gets wrong, and Model B gets 8 examples right that Model A gets wrong, McNemar's test determines whether this 20 vs. 8 imbalance is statistically significant.
from scipy import stats import numpy as np def mcnemar_test(model_a_correct: list[bool], model_b_correct: list[bool]) -> dict: """McNemar's test for comparing two models on the same examples. Focuses on discordant pairs (examples where models disagree). """ a = np.array(model_a_correct) b = np.array(model_b_correct) # Count the four outcome types both_correct = np.sum(a & b) both_wrong = np.sum(~a & ~b) a_only = np.sum(a & ~b) # A correct, B wrong b_only = np.sum(~a & b) # B correct, A wrong # McNemar's test with continuity correction if a_only + b_only == 0: return {"message": "No discordant pairs; models agree on all examples"} chi2 = (abs(a_only - b_only) - 1) ** 2 / (a_only + b_only) p_value = 1 - stats.chi2.cdf(chi2, df=1) return { "contingency": { "both_correct": int(both_correct), "a_only_correct": int(a_only), "b_only_correct": int(b_only), "both_wrong": int(both_wrong) }, "chi2_statistic": round(chi2, 4), "p_value": round(p_value, 6), "significant_at_005": p_value < 0.05, "model_a_accuracy": round(np.mean(a), 4), "model_b_accuracy": round(np.mean(b), 4) } # Simulate evaluation results for 200 test examples np.random.seed(42) model_a = np.random.binomial(1, 0.82, size=200).astype(bool) model_b = np.random.binomial(1, 0.75, size=200).astype(bool) result = mcnemar_test(model_a.tolist(), model_b.tolist()) for key, val in result.items(): print(f" {key}: {val}")
Paired Bootstrap Test
For continuous scores (such as BERTScore, ROUGE F1, or judge ratings), the paired bootstrap test computes a confidence interval on the difference in metric values. By resampling paired differences, you directly estimate whether the gap between two systems is reliably different from zero.
def paired_bootstrap_test( scores_a: list[float], scores_b: list[float], n_bootstrap: int = 10_000, seed: int = 42 ) -> dict: """Paired bootstrap test for comparing two systems. Tests whether the difference in mean scores is significant. """ rng = np.random.default_rng(seed) a = np.array(scores_a) b = np.array(scores_b) diffs = a - b # per-example differences n = len(diffs) observed_diff = np.mean(diffs) # Bootstrap the mean difference boot_diffs = [] for _ in range(n_bootstrap): sample = rng.choice(diffs, size=n, replace=True) boot_diffs.append(np.mean(sample)) boot_diffs = np.array(boot_diffs) # Two-sided p-value: fraction of bootstrap diffs on wrong side of zero if observed_diff >= 0: p_value = 2 * np.mean(boot_diffs <= 0) else: p_value = 2 * np.mean(boot_diffs >= 0) ci_lower = np.percentile(boot_diffs, 2.5) ci_upper = np.percentile(boot_diffs, 97.5) # Cohen's d effect size cohens_d = observed_diff / np.std(diffs) return { "mean_diff": round(observed_diff, 4), "ci_95": (round(ci_lower, 4), round(ci_upper, 4)), "p_value": round(p_value, 4), "cohens_d": round(cohens_d, 4), "significant": p_value < 0.05 } # Example: BERTScore F1 values for two models on same 50 examples np.random.seed(42) scores_a = np.random.normal(0.88, 0.05, 50) scores_b = np.random.normal(0.85, 0.06, 50) result = paired_bootstrap_test(scores_a.tolist(), scores_b.tolist()) print(f"Mean difference (A - B): {result['mean_diff']}") print(f"95% CI: {result['ci_95']}") print(f"p-value: {result['p_value']}") print(f"Cohen's d: {result['cohens_d']}") print(f"Significant at p < 0.05: {result['significant']}")
Choosing the Right Test
| Scenario | Test | When to Use | Key Assumption |
|---|---|---|---|
| Binary outcomes | McNemar's test | Correct/incorrect classification | Same test set for both models |
| Continuous scores, paired | Paired bootstrap | BERTScore, ROUGE, judge ratings | Same test set for both models |
| Continuous scores, paired, normal | Paired t-test | Large sample sizes (n > 30) | Approximately normal differences |
| Multiple system comparison | Friedman test + post-hoc | Ranking 3+ models | Same test set for all models |
| Ordinal ratings | Wilcoxon signed-rank | Human ratings on Likert scale | Symmetric difference distribution |
When comparing k model variants, the probability of at least one false positive grows rapidly. With 10 comparisons at α = 0.05, you have a 40% chance of at least one false positive. Apply the Bonferroni correction (divide α by the number of comparisons) or, better, use the Holm-Bonferroni method, which is less conservative while still controlling the family-wise error rate.
4. Effect Sizes
Statistical significance tells you whether a difference exists; effect size tells you whether it matters. A p-value of 0.001 with an effect size of 0.02 means the difference is real but trivially small. Always report both significance and effect size.
Cohen's d is the most common effect size measure. It expresses the mean difference in units of standard deviation. The conventional interpretation is: d = 0.2 (small), d = 0.5 (medium), d = 0.8 (large). For LLM evaluation, consider the practical significance in your specific context: a 0.5% accuracy improvement may be significant for a medical diagnosis system but irrelevant for a chatbot.
5. Seed Management and Reproducibility
Random seeds affect LLM evaluation in multiple ways: data shuffling, few-shot example selection, sampling temperature, dropout at inference (for some models), and bootstrap resampling. Proper seed management ensures that results can be reproduced exactly and that reported variance reflects true model uncertainty rather than implementation randomness.
import random import numpy as np import json from dataclasses import dataclass, asdict from typing import Optional @dataclass class ExperimentConfig: """Configuration for a reproducible LLM experiment.""" model_name: str prompt_template: str temperature: float = 0.0 max_tokens: int = 512 eval_seed: int = 42 # seed for data shuffling/sampling bootstrap_seed: int = 123 # seed for statistical analysis num_eval_seeds: int = 5 # number of seeds for variance estimation def get_eval_seeds(self) -> list[int]: """Generate deterministic evaluation seeds.""" rng = np.random.default_rng(self.eval_seed) return rng.integers(0, 2**31, size=self.num_eval_seeds).tolist() def run_multi_seed_evaluation(config: ExperimentConfig, eval_fn) -> dict: """Run evaluation across multiple seeds and report aggregated results.""" seeds = config.get_eval_seeds() all_results = [] for seed in seeds: random.seed(seed) np.random.seed(seed) result = eval_fn(config, seed=seed) all_results.append(result) print(f" Seed {seed}: accuracy = {result['accuracy']:.4f}") accuracies = [r["accuracy"] for r in all_results] return { "mean_accuracy": round(np.mean(accuracies), 4), "std_accuracy": round(np.std(accuracies), 4), "min_accuracy": round(min(accuracies), 4), "max_accuracy": round(max(accuracies), 4), "seeds_used": seeds, "config": asdict(config) }
When using temperature = 0 with the OpenAI API, outputs are mostly deterministic but not perfectly so due to floating-point non-determinism in GPU computations. For truly reproducible results, also set seed in the API request and check the system_fingerprint in the response to verify that the same backend was used. Even then, provider updates can change behavior silently.
6. Ablation Study Design
An ablation study systematically removes or modifies components of a system to measure their individual contribution. In LLM applications, ablation targets include prompt components (system message, few-shot examples, chain-of-thought instructions), retrieval parameters (chunk size, top-k, reranking), model settings (temperature, max tokens), and architectural choices (model size, fine-tuning vs. prompting).
The key principle is to change exactly one variable at a time while holding everything else constant. This isolation is what distinguishes a rigorous ablation from casual experimentation.
from dataclasses import dataclass from typing import Optional import json @dataclass class AblationConfig: """Configuration for one ablation variant.""" name: str use_system_prompt: bool = True use_few_shot: bool = True use_cot: bool = True use_reranker: bool = True model: str = "gpt-4o" def run_ablation_study(eval_fn, test_cases: list, seed: int = 42) -> list[dict]: """Run a structured ablation study with one variable removed at a time.""" configs = [ AblationConfig(name="Full System (Baseline)"), AblationConfig(name="No System Prompt", use_system_prompt=False), AblationConfig(name="No Few-Shot", use_few_shot=False), AblationConfig(name="No Chain-of-Thought", use_cot=False), AblationConfig(name="No Reranker", use_reranker=False), AblationConfig(name="Smaller Model", model="gpt-4o-mini"), ] baseline_score = None results = [] for config in configs: scores = eval_fn(config, test_cases, seed=seed) mean_score = sum(scores) / len(scores) if baseline_score is None: baseline_score = mean_score ci = bootstrap_ci(scores, seed=seed) delta = mean_score - baseline_score delta_pct = (delta / baseline_score) * 100 if baseline_score else 0 results.append({ "variant": config.name, "score": round(mean_score, 4), "ci_95": (ci["ci_lower"], ci["ci_upper"]), "delta": round(delta, 4), "delta_pct": round(delta_pct, 2) }) return results
7. Benchmark Contamination Detection
Benchmark contamination occurs when evaluation data leaks into a model's training corpus. Because LLMs are trained on massive web scrapes, popular benchmarks like MMLU, GSM8K, and HumanEval have a high risk of appearing in training data. When a model has memorized the answers, its benchmark score dramatically overestimates its true capability on novel problems.
Detection Strategies
- N-gram overlap analysis: Check if verbatim benchmark questions appear in known training data dumps
- Perturbation testing: Slightly rephrase benchmark questions and check if accuracy drops sharply (a sign of memorization)
- Membership inference: Use the model's log-probabilities to determine if it assigns unusually high probability to benchmark text
- Canary strings: Insert unique, identifiable strings into evaluation data and check if models reproduce them
import numpy as np from typing import Callable def perturbation_contamination_test( model_fn: Callable, original_questions: list[dict], perturbed_questions: list[dict], threshold: float = 0.15 ) -> dict: """Detect benchmark contamination via perturbation analysis. If accuracy drops sharply on minor rephrasing, the model may have memorized the original questions rather than learned the skill. Args: model_fn: callable that takes a question and returns an answer original_questions: list of {'question': str, 'answer': str} perturbed_questions: rephrased versions with same answers threshold: max acceptable accuracy drop (larger drops = contamination) """ def evaluate_set(questions): correct = 0 for q in questions: prediction = model_fn(q["question"]) if prediction.strip().lower() == q["answer"].strip().lower(): correct += 1 return correct / len(questions) orig_acc = evaluate_set(original_questions) pert_acc = evaluate_set(perturbed_questions) drop = orig_acc - pert_acc return { "original_accuracy": round(orig_acc, 4), "perturbed_accuracy": round(pert_acc, 4), "accuracy_drop": round(drop, 4), "contamination_suspected": drop > threshold, "message": ( "Contamination likely: large accuracy drop on minor rephrasing" if drop > threshold else "No strong contamination signal detected" ) }
Research has shown that many popular benchmarks appear in the training data of major LLMs. The problem is especially severe for benchmarks released before 2023, which have had years to propagate across the web. When evaluating models for deployment decisions, always supplement standard benchmarks with custom evaluation sets that you know are not publicly available.
📝 Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Always report confidence intervals. A point estimate without uncertainty quantification is incomplete. Use bootstrap confidence intervals when you cannot assume a normal distribution.
- Use paired tests for model comparison. When both models are evaluated on the same examples, paired tests (McNemar's for binary, paired bootstrap for continuous) are more powerful and more appropriate than unpaired alternatives.
- Report effect sizes alongside p-values. Statistical significance tells you a difference exists; effect size (Cohen's d) tells you whether it matters in practice. Both are necessary for informed decisions.
- Manage seeds systematically. Fix seeds for reproducibility, run multiple seeds for variance estimation, and never cherry-pick the best seed. Document all seeds in your experiment configuration.
- Design ablations with single-variable isolation. Change one component at a time to measure its individual contribution. Use the same test set, seeds, and evaluation protocol across all ablation variants.
- Assume contamination until proven otherwise. Supplement standard benchmarks with custom, private evaluation sets. Use perturbation testing to detect memorization in public benchmarks.