Section 25.1: LLM Evaluation Fundamentals

★ Big Picture

You cannot improve what you cannot measure, and measuring LLM quality is surprisingly hard. Unlike classification tasks where accuracy tells the whole story, LLM outputs are open-ended, subjective, and context-dependent. A single question can have many correct answers, each with different levels of helpfulness, style, and factual precision. This section builds your evaluation toolkit from low-level language modeling metrics (perplexity, bits-per-byte) through reference-based text metrics (BLEU, ROUGE, BERTScore), to modern approaches like LLM-as-Judge and structured human evaluation. It also covers the major benchmarks that the research community uses to track model capability over time.

1. Intrinsic Language Modeling Metrics

Intrinsic metrics measure how well a model captures the statistical properties of language itself. These metrics are computed directly from the model's probability distributions over tokens, without requiring any downstream task. They are most useful for comparing base (pretrained) models and for monitoring training progress.

Perplexity

Perplexity measures how "surprised" a model is by a sequence of text. Formally, it is the exponentiated average negative log-likelihood per token. A model that assigns high probability to the correct next token at every position will have low perplexity. Lower perplexity means the model is a better predictor of natural language.

For a sequence of N tokens, perplexity is defined as:

PPL = exp( -(1/N) ∑ log P(t_i | t₁, ..., t_i-1) )

Perplexity depends heavily on the tokenizer. A model that uses byte-pair encoding with a 50K vocabulary will have a different perplexity from one using a 100K vocabulary, even if both models are equally capable. This makes direct perplexity comparisons across model families unreliable.

Bits-per-Byte (BPB)

Bits-per-byte normalizes the language modeling loss by the number of UTF-8 bytes rather than tokens, making it comparable across tokenizers and vocabularies. BPB measures how many bits the model needs, on average, to encode each byte of the original text. This is the preferred metric when comparing models with different tokenization schemes.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity_and_bpb(model_name: str, text: str):
    """Compute perplexity and bits-per-byte for a text sample."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto"
    )

    encodings = tokenizer(text, return_tensors="pt")
    input_ids = encodings.input_ids.to(model.device)

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        neg_log_likelihood = outputs.loss  # average NLL per token

    num_tokens = input_ids.size(1)
    perplexity = torch.exp(neg_log_likelihood).item()

    # Bits-per-byte: convert nats to bits, normalize by bytes
    total_nll_bits = neg_log_likelihood.item() * num_tokens / torch.log(torch.tensor(2.0))
    num_bytes = len(text.encode("utf-8"))
    bpb = total_nll_bits / num_bytes

    return {
        "perplexity": round(perplexity, 2),
        "bits_per_byte": round(bpb.item(), 4),
        "num_tokens": num_tokens,
        "num_bytes": num_bytes
    }

# Example usage
text = "The transformer architecture revolutionized natural language processing."
result = compute_perplexity_and_bpb("gpt2", text)
print(result)

{'perplexity': 28.43, 'bits_per_byte': 1.0312, 'num_tokens': 9, 'num_bytes': 65}

📝 Note

Perplexity and BPB only measure how well a model predicts text, not how useful or safe its generations are. A model with excellent perplexity may still produce hallucinations, refuse safe requests, or generate harmful content. These metrics are necessary but far from sufficient for evaluating instruction-tuned or chat models.

Figure 25.1: Taxonomy of LLM evaluation metrics, arranged by approach and cost-validity tradeoff.

2. Reference-Based Text Metrics

Reference-based metrics compare model output against one or more "gold standard" reference texts. They originated in machine translation and summarization, where human-written references provide a ground truth. These metrics are fast, deterministic, and cheap to compute, but they share a fundamental limitation: they assume the reference text captures the space of acceptable answers, which is rarely true for open-ended generation.

BLEU, ROUGE, and BERTScore

Metric	What It Measures	Best For	Limitation
BLEU	N-gram precision (how many n-grams in the output appear in the reference)	Machine translation	Ignores recall; penalizes valid paraphrases
ROUGE-L	Longest common subsequence between output and reference	Summarization	Surface-level overlap; misses semantic equivalence
ROUGE-N	N-gram recall (how many reference n-grams appear in the output)	Summarization	Same surface-level limitations as ROUGE-L
BERTScore	Cosine similarity of contextualized token embeddings	Any text comparison	Computationally expensive; correlation varies by task
METEOR	Unigram matching with stemming, synonyms, and paraphrase support	Machine translation	Complex; still surface-level for LLM outputs

from rouge_score import rouge_scorer
from bert_score import score as bert_score
import evaluate

# Reference and candidate texts
reference = "The cat sat on the mat and watched the birds outside."
candidate = "A cat was sitting on a mat, observing birds through the window."

# ROUGE scores
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
rouge_results = scorer.score(reference, candidate)
for key, value in rouge_results.items():
    print(f"{key}: precision={value.precision:.3f}, recall={value.recall:.3f}, f1={value.fmeasure:.3f}")

# BLEU score
bleu = evaluate.load("bleu")
bleu_result = bleu.compute(
    predictions=[candidate],
    references=[[reference]]
)
print(f"BLEU: {bleu_result['bleu']:.4f}")

# BERTScore
P, R, F1 = bert_score([candidate], [reference], lang="en", verbose=False)
print(f"BERTScore: precision={P[0]:.4f}, recall={R[0]:.4f}, f1={F1[0]:.4f}")

rouge1: precision=0.583, recall=0.700, f1=0.636 rouge2: precision=0.100, recall=0.111, f1=0.105 rougeL: precision=0.417, recall=0.500, f1=0.455 BLEU: 0.1174 BERTScore: precision=0.9412, recall=0.9387, f1=0.9399

💡 Key Insight

Notice how BERTScore (0.94 F1) captures the semantic similarity between these paraphrases far better than BLEU (0.12) or ROUGE-2 (0.11). When outputs are valid paraphrases of references, embedding-based metrics like BERTScore provide much more meaningful signal. However, no single metric tells the full story. Use multiple metrics together and always validate against human judgment.

3. LLM-as-Judge

The LLM-as-Judge paradigm uses a strong language model (typically GPT-4, Claude, or a specialized judge model) to evaluate the outputs of other models. This approach has rapidly become the most popular evaluation method for instruction-tuned models because it can assess open-ended qualities like helpfulness, safety, and reasoning quality that reference-based metrics cannot capture.

Judge Prompt Design

The quality of LLM-as-Judge evaluation depends entirely on the judge prompt. A well-designed prompt specifies clear evaluation criteria, provides a rubric with concrete examples at each score level, and asks the judge to produce a structured output (typically a score plus reasoning). The reasoning-before-score pattern (where the judge explains its rationale before assigning a numeric score) has been shown to produce more reliable evaluations.

from openai import OpenAI
import json

client = OpenAI()

def llm_judge_evaluate(question: str, answer: str, criteria: str) -> dict:
    """Use an LLM as a judge to evaluate answer quality."""
    judge_prompt = f"""You are an expert evaluator. Assess the following answer to a question.

QUESTION: {question}

ANSWER: {answer}

EVALUATION CRITERIA: {criteria}

RUBRIC:
5 = Excellent: Fully addresses the question with accurate, comprehensive, well-organized content
4 = Good: Addresses the question well with minor gaps or imprecisions
3 = Adequate: Partially addresses the question but has notable gaps
2 = Poor: Mostly fails to address the question or contains significant errors
1 = Very Poor: Completely off-topic, factually wrong, or incoherent

Provide your evaluation as JSON with these fields:
- "reasoning": Your step-by-step analysis (2-3 sentences)
- "score": Integer from 1 to 5
- "strengths": List of strengths
- "weaknesses": List of weaknesses"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"},
        temperature=0.0
    )

    return json.loads(response.choices[0].message.content)

# Example evaluation
result = llm_judge_evaluate(
    question="What causes seasons on Earth?",
    answer="Seasons happen because Earth's axis is tilted at 23.5 degrees relative to its orbital plane. This means different hemispheres receive more direct sunlight at different times of year.",
    criteria="Factual accuracy, completeness, and clarity"
)
print(json.dumps(result, indent=2))

{ "reasoning": "The answer correctly identifies axial tilt as the cause and mentions the 23.5-degree angle. It explains the mechanism clearly but omits details about orbital position and equinoxes.", "score": 4, "strengths": ["Correct identification of axial tilt", "Specific degree mentioned", "Clear explanation"], "weaknesses": ["No mention of orbital position", "Missing equinox/solstice details"] }

Common LLM-as-Judge Biases

LLM judges are powerful but carry systematic biases that evaluators must account for. Understanding these biases is essential for interpreting judge results correctly.

Position bias: Judges tend to prefer the first option in pairwise comparisons. Mitigation: randomize order and average scores across both orderings.
Verbosity bias: Judges often favor longer, more detailed answers even when the extra content adds no value. Mitigation: include "conciseness is a virtue" in the rubric.
Self-preference bias: Models tend to rate their own outputs higher than outputs from other models. Mitigation: use a judge model different from the models being evaluated.
Authority bias: Confident, well-formatted answers receive higher scores even when factually wrong. Mitigation: explicitly instruct the judge to verify factual claims.

Figure 25.2: The LLM-as-Judge evaluation pipeline, from test inputs through structured scoring output.

4. Human Evaluation

Human evaluation remains the gold standard for assessing LLM quality, especially for subjective dimensions like helpfulness, creativity, and conversational naturalness. However, human evaluation is expensive, slow, and introduces its own variability. The key challenge is designing evaluation protocols that are reliable (different annotators agree) and valid (they measure what you care about).

Evaluation Protocol Design

Effective human evaluation protocols share several characteristics. First, they define clear, specific criteria with concrete examples at each rating level. Second, they include calibration rounds where annotators evaluate the same examples and discuss disagreements before the main annotation begins. Third, they compute inter-annotator agreement metrics (like Cohen's kappa or Krippendorff's alpha) to verify that the task is well-defined enough for consistent human judgment.

import numpy as np
from sklearn.metrics import cohen_kappa_score

def compute_inter_annotator_agreement(annotations: dict) -> dict:
    """Compute pairwise Cohen's kappa between annotators.

    Args:
        annotations: dict mapping annotator_id to list of scores
    """
    annotator_ids = list(annotations.keys())
    kappas = {}

    for i in range(len(annotator_ids)):
        for j in range(i + 1, len(annotator_ids)):
            a1, a2 = annotator_ids[i], annotator_ids[j]
            kappa = cohen_kappa_score(
                annotations[a1], annotations[a2], weights="quadratic"
            )
            kappas[f"{a1}_vs_{a2}"] = round(kappa, 3)

    avg_kappa = np.mean(list(kappas.values()))
    return {"pairwise_kappas": kappas, "mean_kappa": round(avg_kappa, 3)}

# Three annotators rating 10 examples on a 1-5 scale
annotations = {
    "annotator_A": [4, 3, 5, 2, 4, 3, 5, 4, 3, 4],
    "annotator_B": [4, 3, 4, 2, 5, 3, 5, 3, 3, 4],
    "annotator_C": [5, 3, 5, 3, 4, 2, 4, 4, 3, 5],
}

agreement = compute_inter_annotator_agreement(annotations)
print(f"Pairwise kappas: {agreement['pairwise_kappas']}")
print(f"Mean kappa: {agreement['mean_kappa']}")
# Interpretation: >0.8 = excellent, 0.6-0.8 = good, 0.4-0.6 = moderate

Pairwise kappas: {'annotator_A_vs_annotator_B': 0.814, 'annotator_A_vs_annotator_C': 0.625, 'annotator_B_vs_annotator_C': 0.571} Mean kappa: 0.670

⚠ Warning

A mean kappa of 0.67 indicates "good" agreement, but the wide range (0.57 to 0.81) across annotator pairs suggests that annotator C may be interpreting the rubric differently. Before proceeding with annotation, run additional calibration rounds and review disagreement examples with all annotators to align their understanding of the scoring criteria.

5. Standard Benchmarks

Benchmarks provide standardized test sets that enable comparison across models. The LLM community has developed dozens of benchmarks, each targeting different capabilities. Understanding what each benchmark measures (and what it does not measure) is essential for interpreting model comparison tables.

Figure 25.3: The LLM benchmark landscape organized by capability domain.

Benchmark Comparison

Benchmark	Tasks	Metric	Format	Notes
MMLU	57 subjects (STEM, humanities, social science)	Accuracy	Multiple choice	Most widely cited; saturation concern
HumanEval	164 Python coding problems	pass@k	Code generation	Tests functional correctness via unit tests
MT-Bench	80 multi-turn questions across 8 categories	LLM-as-Judge (1-10)	Open-ended	Tests multi-turn instruction following
Chatbot Arena	Crowdsourced pairwise comparisons	Elo rating	Side-by-side	Most ecologically valid; slow to update
GSM8K	1,319 grade-school math word problems	Exact match	Free-form numeric	Nearly saturated by frontier models
SWE-Bench	Real GitHub issues requiring code fixes	% resolved	Repository-level code	Tests real-world software engineering

import json
from datasets import load_dataset

def run_mmlu_sample(model_fn, num_samples: int = 20):
    """Evaluate a model on a sample of MMLU questions.

    Args:
        model_fn: callable that takes a prompt and returns A/B/C/D
        num_samples: number of questions to evaluate
    """
    dataset = load_dataset("cais/mmlu", "all", split="test")
    dataset = dataset.shuffle(seed=42).select(range(num_samples))

    correct = 0
    results_by_subject = {}

    for example in dataset:
        choices = example["choices"]
        prompt = f"""Question: {example['question']}

A) {choices[0]}
B) {choices[1]}
C) {choices[2]}
D) {choices[3]}

Answer with just the letter (A, B, C, or D):"""

        prediction = model_fn(prompt).strip().upper()
        answer_map = {0: "A", 1: "B", 2: "C", 3: "D"}
        correct_letter = answer_map[example["answer"]]
        is_correct = prediction[0] == correct_letter if prediction else False

        subject = example["subject"]
        if subject not in results_by_subject:
            results_by_subject[subject] = {"correct": 0, "total": 0}
        results_by_subject[subject]["total"] += 1
        if is_correct:
            results_by_subject[subject]["correct"] += 1
            correct += 1

    accuracy = correct / num_samples
    return {
        "overall_accuracy": round(accuracy, 3),
        "num_correct": correct,
        "num_total": num_samples,
        "by_subject": results_by_subject
    }

📝 Benchmark Contamination

A growing concern with standard benchmarks is data contamination: the possibility that benchmark questions appeared in the model's training data. When a model has "seen" the test questions during training, its benchmark score overestimates true capability. Always check whether a model's technical report discusses contamination analysis, and consider using dynamic benchmarks (like LiveCodeBench or Chatbot Arena) that continuously add new questions.

6. Building an Evaluation Harness

In practice, you rarely evaluate with a single metric. A well-designed evaluation harness combines multiple metrics, runs them across a curated test set, and produces a structured report that tracks performance over time. The following example shows how to build a simple but extensible evaluation framework.

from dataclasses import dataclass, field
from typing import Callable
import json, time

@dataclass
class EvalCase:
    """A single evaluation test case."""
    question: str
    reference: str = ""
    metadata: dict = field(default_factory=dict)

@dataclass
class EvalResult:
    """Result of evaluating one test case."""
    case: EvalCase
    model_output: str
    scores: dict
    latency_ms: float

class EvalHarness:
    """Lightweight evaluation harness for LLM applications."""

    def __init__(self, model_fn: Callable, scorers: dict[str, Callable]):
        self.model_fn = model_fn  # callable: prompt -> response
        self.scorers = scorers    # dict of name -> scoring function

    def evaluate(self, cases: list[EvalCase]) -> list[EvalResult]:
        """Run all test cases through the model and score them."""
        results = []
        for case in cases:
            start = time.time()
            output = self.model_fn(case.question)
            latency = (time.time() - start) * 1000

            scores = {}
            for name, scorer in self.scorers.items():
                scores[name] = scorer(
                    question=case.question,
                    output=output,
                    reference=case.reference
                )

            results.append(EvalResult(
                case=case, model_output=output,
                scores=scores, latency_ms=round(latency, 1)
            ))
        return results

    def summary(self, results: list[EvalResult]) -> dict:
        """Aggregate results into a summary report."""
        import numpy as np
        metric_names = list(results[0].scores.keys())
        summary = {}
        for metric in metric_names:
            values = [r.scores[metric] for r in results]
            summary[metric] = {
                "mean": round(np.mean(values), 3),
                "std": round(np.std(values), 3),
                "min": round(min(values), 3),
                "max": round(max(values), 3),
            }
        summary["mean_latency_ms"] = round(
            np.mean([r.latency_ms for r in results]), 1
        )
        return summary

💡 Key Insight

The best evaluation harnesses separate three concerns: (1) test case management (what to evaluate), (2) model invocation (how to get outputs), and (3) scoring (how to measure quality). This separation makes it easy to swap models, add new metrics, or reuse test cases across experiments. Start simple, then extend as your evaluation needs grow.

📝 Knowledge Check

1. Why is bits-per-byte (BPB) preferred over perplexity when comparing models with different tokenizers?

Show Answer

BPB normalizes by the number of UTF-8 bytes rather than tokens, making it independent of the tokenizer vocabulary. Perplexity values are not directly comparable across models that use different tokenization schemes because a model with a larger vocabulary may assign fewer tokens to the same text, artificially lowering its per-token perplexity.

2. A BLEU score of 0.12 and a BERTScore F1 of 0.94 for the same candidate-reference pair suggest what about the candidate text?

Show Answer

The candidate is a valid paraphrase of the reference. The low BLEU score indicates little n-gram overlap (different surface words), while the high BERTScore indicates strong semantic similarity (same meaning). This pattern is common when a model produces a correct answer using different wording than the reference.

3. What is "position bias" in LLM-as-Judge evaluation, and how do you mitigate it?

Show Answer

Position bias is the tendency of judge LLMs to prefer whichever response appears first in a pairwise comparison prompt. Mitigation involves running each comparison twice (swapping the order of the two responses) and averaging the scores. If a response wins regardless of position, the preference is robust.

4. What does a Cohen's kappa of 0.67 indicate about inter-annotator agreement, and what should you do before proceeding with human evaluation?

Show Answer

A kappa of 0.67 indicates "good" but not "excellent" agreement (the threshold for excellent is typically 0.8). Before proceeding, you should run additional calibration rounds, review specific examples where annotators disagree, clarify ambiguous rubric criteria, and potentially revise the annotation guidelines until agreement improves.

5. Why are dynamic benchmarks like Chatbot Arena and LiveCodeBench more resistant to contamination than static benchmarks like MMLU?

Show Answer

Dynamic benchmarks continuously generate new evaluation data (new user queries in Chatbot Arena, new coding problems in LiveCodeBench) that did not exist when models were trained. This makes it impossible for the test questions to have appeared in training data. Static benchmarks like MMLU have fixed question sets that may leak into training corpora through web scraping or dataset aggregation.

Key Takeaways

Perplexity measures prediction, not usefulness. Low perplexity is necessary but not sufficient for a good LLM. Instruction-tuned models may have higher perplexity than base models while being far more useful in practice.
Reference-based metrics fail on open-ended tasks. BLEU and ROUGE penalize valid paraphrases. BERTScore captures semantics better but still cannot assess helpfulness, safety, or reasoning quality.
LLM-as-Judge is powerful but biased. Account for position bias, verbosity bias, and self-preference bias through careful prompt design and evaluation protocol (order randomization, multi-judge consensus).
Human evaluation requires rigorous protocol design. Define clear rubrics with concrete examples, run calibration rounds, and measure inter-annotator agreement before trusting human ratings.
No single benchmark tells the full story. Evaluate across multiple dimensions (knowledge, reasoning, code, chat quality) and prefer dynamic benchmarks that resist contamination.
Build evaluation harnesses early. Separating test cases, model invocation, and scoring into modular components makes your evaluation infrastructure reusable and extensible as your project evolves.