Section 26.6: Hallucination & Reliability

★ Big Picture

Hallucination is the single biggest obstacle to LLM reliability in production. Models generate fluent, confident text that is factually wrong, internally inconsistent, or unsupported by any source. This section covers the taxonomy of hallucination types, practical detection techniques (self-consistency, citation verification, natural language inference), and mitigation strategies ranging from RAG grounding to constrained generation and calibrated abstention.

1. Hallucination Taxonomy

Type	Description	Example
Factual fabrication	Inventing facts that sound plausible	Citing a non-existent paper
Intrinsic hallucination	Contradicting the provided source text	Summarizing a document with wrong numbers
Extrinsic hallucination	Adding information not in the source	Introducing claims absent from context
Self-contradiction	Making inconsistent statements within one response	Saying "X is true" then "X is false"
Outdated knowledge	Stating facts that were true at training time but not now	Reporting a CEO who has since been replaced

2. Self-Consistency Detection

from openai import OpenAI

client = OpenAI()

def self_consistency_check(question: str, n_samples: int = 5, temperature: float = 0.8):
    """Generate multiple answers and check agreement."""
    responses = []
    for _ in range(n_samples):
        r = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": question}],
            temperature=temperature,
        )
        responses.append(r.choices[0].message.content.strip())

    # Use LLM to check consistency across responses
    check_prompt = f"""Given these {n_samples} answers to the same question, are they
consistent with each other? Report any contradictions.

Question: {question}

Answers:
""" + "\n".join(f"{i+1}. {r}" for i, r in enumerate(responses))

    verdict = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": check_prompt}],
        temperature=0.0,
    )
    return {"responses": responses, "consistency_check": verdict.choices[0].message.content}

Figure 26.6.1: Three complementary approaches detect different types of hallucination at different stages.

NLI-Based Hallucination Detection

from transformers import pipeline

nli = pipeline("text-classification", model="facebook/bart-large-mnli")

def check_faithfulness(source: str, claim: str) -> dict:
    """Check if a claim is supported by the source using NLI."""
    result = nli(f"{source}", candidate_labels=[
        "entailment", "contradiction", "neutral"
    ])
    return {
        "claim": claim,
        "supported": result["labels"][0] == "entailment",
        "label": result["labels"][0],
        "confidence": round(result["scores"][0], 3),
    }

source = "The Eiffel Tower was completed in 1889 and is 330 meters tall."
print(check_faithfulness(source, "The Eiffel Tower is 330 meters tall."))
print(check_faithfulness(source, "The Eiffel Tower was built in 1920."))

3. Mitigation Strategies

def calibrated_abstention(question: str, context: str, threshold: float = 0.7):
    """Generate answer with confidence score; abstain if below threshold."""
    from openai import OpenAI
    client = OpenAI()

    prompt = f"""Based ONLY on the context below, answer the question.
After your answer, rate your confidence from 0.0 to 1.0.

Context: {context}
Question: {question}

Format:
Answer: [your answer]
Confidence: [0.0 to 1.0]"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
    )
    text = response.choices[0].message.content

    # Parse confidence
    import re
    conf_match = re.search(r"Confidence:\s*([\d.]+)", text)
    confidence = float(conf_match.group(1)) if conf_match else 0.0

    if confidence < threshold:
        return {"answer": "I don't have enough information to answer confidently.",
                "confidence": confidence, "abstained": True}
    return {"answer": text, "confidence": confidence, "abstained": False}

Figure 26.6.2: Mitigation strategies range from RAG grounding (simple but partial) to calibrated abstention (most reliable for high-stakes use cases).

⚠ Warning

LLM self-reported confidence scores are not well calibrated. Models tend to express high confidence even when wrong. Use self-consistency (agreement across multiple samples) as a more reliable confidence proxy than asking the model to rate its own certainty.

📝 Note

RAG reduces but does not eliminate hallucination. Models can hallucinate even with perfect context if the answer requires reasoning that the model performs incorrectly, or if the model ignores the context and relies on parametric knowledge instead. Always verify RAG outputs against their cited sources.

★ Key Insight

For high-stakes applications (medical, legal, financial), the best strategy is calibrated abstention: the system should refuse to answer when confidence is low rather than risk a convincing but wrong response. Users prefer "I don't know" over a confident hallucination that could lead to harmful decisions.

Knowledge Check

1. What is the difference between intrinsic and extrinsic hallucination?

Show Answer

Intrinsic hallucination contradicts the provided source text (e.g., stating wrong numbers from a document). Extrinsic hallucination introduces information not present in any source (e.g., adding fabricated claims). Intrinsic hallucinations are easier to detect because they conflict with available evidence.

2. How does self-consistency detect hallucination?

Show Answer

Self-consistency generates multiple responses to the same question at high temperature and checks whether they agree. When the model is confident and correct, different samples tend to converge on the same answer. When the model is hallucinating, samples diverge because there is no grounded answer to converge on. Low agreement signals potential hallucination.

3. Why is NLI useful for hallucination detection in RAG systems?

Show Answer

NLI (Natural Language Inference) models classify whether a hypothesis is entailed by, contradicts, or is neutral to a premise. In RAG, the retrieved context is the premise and each claim in the LLM output is a hypothesis. Claims labeled as "contradiction" indicate intrinsic hallucination; "neutral" claims may indicate extrinsic hallucination.

4. What is calibrated abstention and when should it be used?

Show Answer

Calibrated abstention is a strategy where the system refuses to answer when its confidence falls below a threshold, responding with "I don't have enough information" instead. It should be used in high-stakes domains (medical, legal, financial) where a confident but wrong answer could cause real harm, and where users would rather receive no answer than an unreliable one.

5. Why doesn't RAG completely solve the hallucination problem?

Show Answer

RAG provides relevant context, but the model can still hallucinate for several reasons: it may perform incorrect reasoning over the context, ignore the context and rely on parametric knowledge, combine information from multiple passages in invalid ways, or extrapolate beyond what the context actually states. RAG reduces but does not eliminate the fundamental tendency to generate unsupported claims.

Key Takeaways

Hallucinations come in five types: factual fabrication, intrinsic contradiction, extrinsic addition, self-contradiction, and outdated knowledge.
Self-consistency detection generates multiple samples and measures agreement; low agreement signals potential hallucination.
NLI-based detection checks whether output claims are entailed by the source context, catching both intrinsic and extrinsic hallucination.
RAG grounding reduces hallucination by anchoring responses to retrieved documents, but does not eliminate it entirely.
Calibrated abstention is the safest strategy for high-stakes applications: refuse to answer when confidence is low.
LLM self-reported confidence is not well calibrated; use self-consistency or NLI scores as more reliable confidence proxies.