Section 25.3: RAG & Agent Evaluation

★ Big Picture

RAG systems and agents introduce evaluation challenges that go far beyond standard LLM metrics. A RAG pipeline can fail at retrieval (wrong documents), at generation (hallucinating beyond retrieved context), or at both. An agent can select the wrong tool, call tools in the wrong order, or produce a correct final answer through an unsafe trajectory. This section covers specialized evaluation metrics and frameworks for these compound systems, including RAGAS metrics (faithfulness, answer relevancy, context precision, context recall), agent task completion and trajectory evaluation, and practical frameworks like DeepEval, Ragas, and Phoenix.

1. Why RAG and Agent Evaluation Is Different

Standard LLM evaluation treats the model as a black box: question in, answer out. RAG and agent systems are multi-component pipelines where failures can occur at any stage. Evaluating only the final answer misses critical information about where the system failed and how to fix it. Component-level evaluation isolates retrieval quality from generation quality, enabling targeted improvements.

For RAG systems, the two fundamental questions are: (1) Did the retriever find the right information? (2) Did the generator use that information faithfully? For agents, the questions expand to: Did the agent choose the right tools? Did it call them with correct parameters? Did it follow a safe and efficient trajectory? Was the final answer correct?

Figure 25.7: Evaluation metrics mapped to each stage of the RAG pipeline.

2. RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) is a framework that decomposes RAG evaluation into component-level metrics. Each metric isolates a specific aspect of the pipeline, making it possible to diagnose whether failures originate in retrieval, generation, or both.

Core RAGAS Metrics

Metric	What It Measures	Requires Ground Truth?	Score Range
Faithfulness	Whether the answer is supported by the retrieved context	No (uses context only)	0 to 1
Answer Relevancy	Whether the answer addresses the question	No	0 to 1
Context Precision	Whether retrieved chunks are relevant (not noisy)	Yes	0 to 1
Context Recall	Whether all necessary information was retrieved	Yes	0 to 1
Answer Correctness	Factual accuracy of the final answer	Yes	0 to 1

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset with required columns
eval_data = {
    "question": [
        "What is the capital of France?",
        "How does photosynthesis work?",
        "What causes tides?",
    ],
    "answer": [
        "The capital of France is Paris, which is also its largest city.",
        "Photosynthesis converts CO2 and water into glucose using sunlight.",
        "Tides are caused by gravitational pull of the moon and sun.",
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France."],
        ["Photosynthesis is a process by which plants convert light energy into chemical energy."],
        ["Ocean tides are caused by the gravitational forces of the moon."],
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Photosynthesis converts light energy, CO2, and water into glucose and oxygen.",
        "Tides are primarily caused by the moon's gravitational pull on Earth's oceans.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
print("\nPer-question breakdown:")
df = results.to_pandas()
print(df.to_string(index=False))

{'faithfulness': 0.9444, 'answer_relevancy': 0.9231, 'context_precision': 0.8889, 'context_recall': 0.8333} Per-question breakdown: question faithfulness answer_relevancy context_precision context_recall What is the capital of France? 1.0000 0.9500 1.0000 1.0000 How does photosynthesis work? 0.8333 0.9193 0.6667 0.7500 What causes tides? 1.0000 0.9000 1.0000 0.7500

📝 Faithfulness vs. Correctness

Faithfulness measures whether the answer is supported by the retrieved context, while correctness measures whether it matches the ground truth. An answer can be faithful (derived only from context) but incorrect (the context itself was wrong or incomplete). Conversely, an answer can be correct but unfaithful (the model "knew" the answer and ignored the context). Both metrics are needed to fully diagnose RAG failures.

Custom Faithfulness Scorer

from openai import OpenAI
import json

client = OpenAI()

def score_faithfulness(question: str, answer: str, contexts: list[str]) -> dict:
    """Score faithfulness: is the answer grounded in the provided context?

    Uses an LLM judge to decompose the answer into claims and verify
    each claim against the context.
    """
    context_text = "\n\n".join(f"Context {i+1}: {c}" for i, c in enumerate(contexts))

    prompt = f"""Evaluate the faithfulness of an answer to the provided context.

QUESTION: {question}
CONTEXT:
{context_text}

ANSWER: {answer}

Instructions:
1. Decompose the answer into individual factual claims
2. For each claim, determine if it is SUPPORTED or NOT SUPPORTED by the context
3. Return JSON with:
   - "claims": list of {{"claim": str, "supported": bool, "evidence": str}}
   - "faithfulness_score": fraction of supported claims (0.0 to 1.0)"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

# Example usage
result = score_faithfulness(
    question="What is the capital of France?",
    answer="Paris is the capital of France with a population of 2.1 million.",
    contexts=["Paris is the capital and most populous city of France."]
)
print(json.dumps(result, indent=2))

{ "claims": [ {"claim": "Paris is the capital of France", "supported": true, "evidence": "Context states Paris is the capital of France"}, {"claim": "Paris has a population of 2.1 million", "supported": false, "evidence": "Population figure not mentioned in context"} ], "faithfulness_score": 0.5 }

3. Agent Evaluation

Evaluating agents is fundamentally harder than evaluating simple question-answering systems because agents take multi-step actions with real-world side effects. A correct final answer does not mean the agent followed a safe or efficient path to get there. Agent evaluation therefore requires assessing multiple dimensions: task completion, tool selection accuracy, parameter correctness, trajectory efficiency, and safety.

Evaluation Dimensions for Agents

Task completion: Did the agent achieve the stated goal?
Tool accuracy: Did the agent select the correct tools for each step?
Parameter correctness: Were tool calls made with valid and appropriate parameters?
Trajectory efficiency: Did the agent take unnecessary steps or redundant tool calls?
Safety: Did the agent avoid dangerous or unauthorized actions?
Cost efficiency: How many LLM calls and tokens were consumed?

Figure 25.8: Comparing ideal vs. actual agent trajectories. The agent got the right answer but through an inefficient path.

from dataclasses import dataclass
from typing import Optional

@dataclass
class ToolCall:
    """A single tool call in an agent trajectory."""
    tool_name: str
    parameters: dict
    result: Optional[str] = None
    is_correct_tool: Optional[bool] = None

@dataclass
class AgentTrajectory:
    """Complete record of an agent's execution."""
    task: str
    tool_calls: list[ToolCall]
    final_answer: str
    total_tokens: int = 0
    total_latency_ms: float = 0

def evaluate_agent_trajectory(
    trajectory: AgentTrajectory,
    ideal_tools: list[str],
    expected_answer: str,
    answer_checker=None
) -> dict:
    """Evaluate an agent trajectory against an ideal reference.

    Args:
        trajectory: The actual execution trajectory
        ideal_tools: Ordered list of expected tool names
        expected_answer: The ground truth answer
        answer_checker: Optional function for fuzzy answer matching
    """
    actual_tools = [tc.tool_name for tc in trajectory.tool_calls]

    # Task completion: did the agent get the right answer?
    if answer_checker:
        task_complete = answer_checker(trajectory.final_answer, expected_answer)
    else:
        task_complete = trajectory.final_answer.strip().lower() == expected_answer.strip().lower()

    # Tool accuracy: fraction of calls using correct tools
    correct_tools = sum(
        1 for t in actual_tools if t in ideal_tools
    )
    tool_accuracy = correct_tools / len(actual_tools) if actual_tools else 0

    # Trajectory efficiency: ideal steps / actual steps
    efficiency = min(len(ideal_tools) / len(actual_tools), 1.0) if actual_tools else 0

    # Redundancy: duplicate consecutive tool calls
    redundant = sum(
        1 for i in range(1, len(actual_tools))
        if actual_tools[i] == actual_tools[i - 1]
    )

    return {
        "task_completed": task_complete,
        "tool_accuracy": round(tool_accuracy, 3),
        "trajectory_efficiency": round(efficiency, 3),
        "num_steps": len(actual_tools),
        "ideal_steps": len(ideal_tools),
        "redundant_calls": redundant,
        "total_tokens": trajectory.total_tokens,
    }

💡 Key Insight

Agent evaluation should weight task completion most heavily, since a correct answer through an inefficient path is better than an efficient trajectory with a wrong answer. However, trajectory quality matters for cost, latency, and safety. In production, an agent that consistently takes extra steps will cost more and may expose the system to more failure points. Evaluate both dimensions and set acceptable thresholds for each.

4. Evaluation Frameworks Comparison

Framework	Focus	Key Metrics	Strengths	Limitations
Ragas	RAG evaluation	Faithfulness, relevancy, context precision/recall	Most comprehensive RAG metrics; good HF integration	Relies on LLM judge; can be slow
DeepEval	General LLM testing	Hallucination, bias, toxicity, custom metrics	pytest integration; CI/CD friendly	Less RAG-specific depth than Ragas
Phoenix (Arize)	Observability + eval	Trace-level metrics, embedding analysis	Visual UI; traces + evals combined	Heavier infrastructure requirement
TruLens	Feedback functions	Groundedness, relevance, custom feedback	Modular feedback system; provider-agnostic	Smaller community than alternatives
promptfoo	Prompt testing	Assertion-based, custom evals	CLI-first; fast iteration; CI/CD native	Less suited for complex agent evaluation

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    HallucinationMetric,
)

def test_rag_with_deepeval():
    """Example: Testing RAG output quality with DeepEval."""
    test_case = LLMTestCase(
        input="What are the benefits of solar energy?",
        actual_output="Solar energy is renewable, reduces electricity bills, "
                      "and has low maintenance costs.",
        retrieval_context=[
            "Solar energy is a renewable source of power that reduces "
            "dependence on fossil fuels.",
            "Solar panels require minimal maintenance and can reduce "
            "electricity bills by up to 50%.",
        ],
    )

    # Define metrics with thresholds
    faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
    relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
    hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")

    # Assert all metrics pass (integrates with pytest)
    assert_test(test_case, [faithfulness, relevancy, hallucination])

# Run with: pytest test_rag.py -v

⚠ LLM Judges Are Not Perfect

All framework-computed metrics that rely on LLM judges inherit the biases and limitations of the judge model. Faithfulness scores can be unreliable when the context is ambiguous or when the judge model hallucinates its own assessment. Always validate framework metrics against a set of human-annotated examples before trusting them for production decisions. Establish the correlation between the automated metric and human judgment on your specific data.

📝 Knowledge Check

1. What is the difference between faithfulness and answer correctness in RAG evaluation?

Show Answer

Faithfulness measures whether the answer is supported by the retrieved context (regardless of whether the context is correct). Answer correctness measures whether the answer matches the ground truth. An answer can be faithful but incorrect (the retrieved context was wrong) or correct but unfaithful (the model ignored context and used its own knowledge). Faithfulness does not require ground truth labels, while answer correctness does.

2. Why is trajectory evaluation important for agents, even when the final answer is correct?

Show Answer

A correct final answer does not guarantee a safe or efficient execution path. The agent may have called unnecessary tools (increasing cost and latency), used dangerous tools it should have avoided, leaked sensitive data through intermediate steps, or reached the correct answer through an unreliable chain of reasoning that could fail on slightly different inputs. Trajectory evaluation helps identify these issues and improves the robustness and cost-efficiency of the agent.

3. When would you choose DeepEval over Ragas for RAG evaluation?

Show Answer

DeepEval is preferable when you need tight integration with pytest and CI/CD pipelines, when you want to evaluate dimensions beyond RAG quality (such as toxicity, bias, or hallucination), or when you want a unified framework for testing multiple types of LLM applications. Ragas is preferable when you need the most comprehensive set of RAG-specific metrics (especially component-level retrieval metrics like context precision and recall) and when RAG evaluation is your primary use case.

4. How does context precision differ from context recall, and which requires ground truth?

Show Answer

Context precision measures the proportion of retrieved chunks that are actually relevant to answering the question (penalizing noise in retrieval). Context recall measures the proportion of information needed to answer the question that was actually retrieved (penalizing missed information). Both require ground truth in the RAGAS framework: context precision needs annotated relevance judgments, and context recall needs a ground truth answer to assess whether all necessary information was retrieved.

5. What is a practical strategy for validating that an LLM-based evaluation metric correlates with human judgment?

Show Answer

Create a calibration set of 50 to 100 examples with human-annotated scores. Run the LLM-based metric on the same examples and compute the correlation (Spearman or Pearson) between automated and human scores. A correlation above 0.7 is generally acceptable. Also examine the examples where the automated metric and human judgment disagree most strongly to understand the metric's failure modes. Repeat this validation periodically, especially after changing the judge model or prompt.

Key Takeaways

Evaluate RAG at component level, not just end-to-end. Use retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy) to isolate where failures occur and guide targeted improvements.
Faithfulness is the most critical RAG metric. A RAG system that generates unfaithful answers (hallucinating beyond context) undermines the entire purpose of retrieval-augmented generation. Monitor faithfulness continuously in production.
Agent evaluation requires trajectory analysis. Assess tool accuracy, parameter correctness, and trajectory efficiency alongside task completion. An agent that completes tasks through unsafe or wasteful paths is a production liability.
Choose frameworks based on your primary need. Ragas excels at RAG-specific metrics, DeepEval integrates with testing pipelines, Phoenix combines observability with evaluation, and promptfoo enables rapid prompt iteration.
Validate automated metrics against human judgment. LLM-based evaluation metrics inherit judge model biases. Always establish correlation with human annotations before trusting automated scores for deployment decisions.