Module 25 · Section 25.3

RAG & Agent Evaluation

RAGAS metrics, faithfulness, context precision and recall, agent trajectory evaluation, and evaluation frameworks
★ Big Picture

RAG systems and agents introduce evaluation challenges that go far beyond standard LLM metrics. A RAG pipeline can fail at retrieval (wrong documents), at generation (hallucinating beyond retrieved context), or at both. An agent can select the wrong tool, call tools in the wrong order, or produce a correct final answer through an unsafe trajectory. This section covers specialized evaluation metrics and frameworks for these compound systems, including RAGAS metrics (faithfulness, answer relevancy, context precision, context recall), agent task completion and trajectory evaluation, and practical frameworks like DeepEval, Ragas, and Phoenix.

1. Why RAG and Agent Evaluation Is Different

Standard LLM evaluation treats the model as a black box: question in, answer out. RAG and agent systems are multi-component pipelines where failures can occur at any stage. Evaluating only the final answer misses critical information about where the system failed and how to fix it. Component-level evaluation isolates retrieval quality from generation quality, enabling targeted improvements.

For RAG systems, the two fundamental questions are: (1) Did the retriever find the right information? (2) Did the generator use that information faithfully? For agents, the questions expand to: Did the agent choose the right tools? Did it call them with correct parameters? Did it follow a safe and efficient trajectory? Was the final answer correct?

RAG Evaluation Points User Query Input Retriever Vector search Context Retrieved chunks Generator LLM answer Retrieval Metrics Context Precision Context Recall Context Relevancy Generation Metrics Faithfulness Answer Relevancy Hallucination Rate End-to-End Answer Correctness Answer Similarity Latency Ground truth annotations required for context recall and answer correctness
Figure 25.7: Evaluation metrics mapped to each stage of the RAG pipeline.

2. RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) is a framework that decomposes RAG evaluation into component-level metrics. Each metric isolates a specific aspect of the pipeline, making it possible to diagnose whether failures originate in retrieval, generation, or both.

Core RAGAS Metrics

Metric What It Measures Requires Ground Truth? Score Range
Faithfulness Whether the answer is supported by the retrieved context No (uses context only) 0 to 1
Answer Relevancy Whether the answer addresses the question No 0 to 1
Context Precision Whether retrieved chunks are relevant (not noisy) Yes 0 to 1
Context Recall Whether all necessary information was retrieved Yes 0 to 1
Answer Correctness Factual accuracy of the final answer Yes 0 to 1
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset with required columns
eval_data = {
    "question": [
        "What is the capital of France?",
        "How does photosynthesis work?",
        "What causes tides?",
    ],
    "answer": [
        "The capital of France is Paris, which is also its largest city.",
        "Photosynthesis converts CO2 and water into glucose using sunlight.",
        "Tides are caused by gravitational pull of the moon and sun.",
    ],
    "contexts": [
        ["Paris is the capital and most populous city of France."],
        ["Photosynthesis is a process by which plants convert light energy into chemical energy."],
        ["Ocean tides are caused by the gravitational forces of the moon."],
    ],
    "ground_truth": [
        "Paris is the capital of France.",
        "Photosynthesis converts light energy, CO2, and water into glucose and oxygen.",
        "Tides are primarily caused by the moon's gravitational pull on Earth's oceans.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
print("\nPer-question breakdown:")
df = results.to_pandas()
print(df.to_string(index=False))
{'faithfulness': 0.9444, 'answer_relevancy': 0.9231, 'context_precision': 0.8889, 'context_recall': 0.8333} Per-question breakdown: question faithfulness answer_relevancy context_precision context_recall What is the capital of France? 1.0000 0.9500 1.0000 1.0000 How does photosynthesis work? 0.8333 0.9193 0.6667 0.7500 What causes tides? 1.0000 0.9000 1.0000 0.7500
📝 Faithfulness vs. Correctness

Faithfulness measures whether the answer is supported by the retrieved context, while correctness measures whether it matches the ground truth. An answer can be faithful (derived only from context) but incorrect (the context itself was wrong or incomplete). Conversely, an answer can be correct but unfaithful (the model "knew" the answer and ignored the context). Both metrics are needed to fully diagnose RAG failures.

Custom Faithfulness Scorer

from openai import OpenAI
import json

client = OpenAI()

def score_faithfulness(question: str, answer: str, contexts: list[str]) -> dict:
    """Score faithfulness: is the answer grounded in the provided context?

    Uses an LLM judge to decompose the answer into claims and verify
    each claim against the context.
    """
    context_text = "\n\n".join(f"Context {i+1}: {c}" for i, c in enumerate(contexts))

    prompt = f"""Evaluate the faithfulness of an answer to the provided context.

QUESTION: {question}
CONTEXT:
{context_text}

ANSWER: {answer}

Instructions:
1. Decompose the answer into individual factual claims
2. For each claim, determine if it is SUPPORTED or NOT SUPPORTED by the context
3. Return JSON with:
   - "claims": list of {{"claim": str, "supported": bool, "evidence": str}}
   - "faithfulness_score": fraction of supported claims (0.0 to 1.0)"""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

# Example usage
result = score_faithfulness(
    question="What is the capital of France?",
    answer="Paris is the capital of France with a population of 2.1 million.",
    contexts=["Paris is the capital and most populous city of France."]
)
print(json.dumps(result, indent=2))
{ "claims": [ {"claim": "Paris is the capital of France", "supported": true, "evidence": "Context states Paris is the capital of France"}, {"claim": "Paris has a population of 2.1 million", "supported": false, "evidence": "Population figure not mentioned in context"} ], "faithfulness_score": 0.5 }

3. Agent Evaluation

Evaluating agents is fundamentally harder than evaluating simple question-answering systems because agents take multi-step actions with real-world side effects. A correct final answer does not mean the agent followed a safe or efficient path to get there. Agent evaluation therefore requires assessing multiple dimensions: task completion, tool selection accuracy, parameter correctness, trajectory efficiency, and safety.

Evaluation Dimensions for Agents

Agent Trajectory: Ideal vs. Actual Ideal Trajectory search_db(q) calculate(data) format_report(r) Final Answer 3 steps Actual Trajectory search_web(q) search_db(q) search_db(q) calculate(data) format_report(r) Answer 6 steps Issues: Wrong tool first (search_web), duplicate call (search_db x2), 2x token cost Task completion: correct, but trajectory efficiency: 3/6 = 0.50
Figure 25.8: Comparing ideal vs. actual agent trajectories. The agent got the right answer but through an inefficient path.
from dataclasses import dataclass
from typing import Optional

@dataclass
class ToolCall:
    """A single tool call in an agent trajectory."""
    tool_name: str
    parameters: dict
    result: Optional[str] = None
    is_correct_tool: Optional[bool] = None

@dataclass
class AgentTrajectory:
    """Complete record of an agent's execution."""
    task: str
    tool_calls: list[ToolCall]
    final_answer: str
    total_tokens: int = 0
    total_latency_ms: float = 0

def evaluate_agent_trajectory(
    trajectory: AgentTrajectory,
    ideal_tools: list[str],
    expected_answer: str,
    answer_checker=None
) -> dict:
    """Evaluate an agent trajectory against an ideal reference.

    Args:
        trajectory: The actual execution trajectory
        ideal_tools: Ordered list of expected tool names
        expected_answer: The ground truth answer
        answer_checker: Optional function for fuzzy answer matching
    """
    actual_tools = [tc.tool_name for tc in trajectory.tool_calls]

    # Task completion: did the agent get the right answer?
    if answer_checker:
        task_complete = answer_checker(trajectory.final_answer, expected_answer)
    else:
        task_complete = trajectory.final_answer.strip().lower() == expected_answer.strip().lower()

    # Tool accuracy: fraction of calls using correct tools
    correct_tools = sum(
        1 for t in actual_tools if t in ideal_tools
    )
    tool_accuracy = correct_tools / len(actual_tools) if actual_tools else 0

    # Trajectory efficiency: ideal steps / actual steps
    efficiency = min(len(ideal_tools) / len(actual_tools), 1.0) if actual_tools else 0

    # Redundancy: duplicate consecutive tool calls
    redundant = sum(
        1 for i in range(1, len(actual_tools))
        if actual_tools[i] == actual_tools[i - 1]
    )

    return {
        "task_completed": task_complete,
        "tool_accuracy": round(tool_accuracy, 3),
        "trajectory_efficiency": round(efficiency, 3),
        "num_steps": len(actual_tools),
        "ideal_steps": len(ideal_tools),
        "redundant_calls": redundant,
        "total_tokens": trajectory.total_tokens,
    }
💡 Key Insight

Agent evaluation should weight task completion most heavily, since a correct answer through an inefficient path is better than an efficient trajectory with a wrong answer. However, trajectory quality matters for cost, latency, and safety. In production, an agent that consistently takes extra steps will cost more and may expose the system to more failure points. Evaluate both dimensions and set acceptable thresholds for each.

4. Evaluation Frameworks Comparison

Framework Focus Key Metrics Strengths Limitations
Ragas RAG evaluation Faithfulness, relevancy, context precision/recall Most comprehensive RAG metrics; good HF integration Relies on LLM judge; can be slow
DeepEval General LLM testing Hallucination, bias, toxicity, custom metrics pytest integration; CI/CD friendly Less RAG-specific depth than Ragas
Phoenix (Arize) Observability + eval Trace-level metrics, embedding analysis Visual UI; traces + evals combined Heavier infrastructure requirement
TruLens Feedback functions Groundedness, relevance, custom feedback Modular feedback system; provider-agnostic Smaller community than alternatives
promptfoo Prompt testing Assertion-based, custom evals CLI-first; fast iteration; CI/CD native Less suited for complex agent evaluation
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    HallucinationMetric,
)

def test_rag_with_deepeval():
    """Example: Testing RAG output quality with DeepEval."""
    test_case = LLMTestCase(
        input="What are the benefits of solar energy?",
        actual_output="Solar energy is renewable, reduces electricity bills, "
                      "and has low maintenance costs.",
        retrieval_context=[
            "Solar energy is a renewable source of power that reduces "
            "dependence on fossil fuels.",
            "Solar panels require minimal maintenance and can reduce "
            "electricity bills by up to 50%.",
        ],
    )

    # Define metrics with thresholds
    faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o")
    relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o")
    hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o")

    # Assert all metrics pass (integrates with pytest)
    assert_test(test_case, [faithfulness, relevancy, hallucination])

# Run with: pytest test_rag.py -v
⚠ LLM Judges Are Not Perfect

All framework-computed metrics that rely on LLM judges inherit the biases and limitations of the judge model. Faithfulness scores can be unreliable when the context is ambiguous or when the judge model hallucinates its own assessment. Always validate framework metrics against a set of human-annotated examples before trusting them for production decisions. Establish the correlation between the automated metric and human judgment on your specific data.

📝 Knowledge Check

1. What is the difference between faithfulness and answer correctness in RAG evaluation?
Show Answer
Faithfulness measures whether the answer is supported by the retrieved context (regardless of whether the context is correct). Answer correctness measures whether the answer matches the ground truth. An answer can be faithful but incorrect (the retrieved context was wrong) or correct but unfaithful (the model ignored context and used its own knowledge). Faithfulness does not require ground truth labels, while answer correctness does.
2. Why is trajectory evaluation important for agents, even when the final answer is correct?
Show Answer
A correct final answer does not guarantee a safe or efficient execution path. The agent may have called unnecessary tools (increasing cost and latency), used dangerous tools it should have avoided, leaked sensitive data through intermediate steps, or reached the correct answer through an unreliable chain of reasoning that could fail on slightly different inputs. Trajectory evaluation helps identify these issues and improves the robustness and cost-efficiency of the agent.
3. When would you choose DeepEval over Ragas for RAG evaluation?
Show Answer
DeepEval is preferable when you need tight integration with pytest and CI/CD pipelines, when you want to evaluate dimensions beyond RAG quality (such as toxicity, bias, or hallucination), or when you want a unified framework for testing multiple types of LLM applications. Ragas is preferable when you need the most comprehensive set of RAG-specific metrics (especially component-level retrieval metrics like context precision and recall) and when RAG evaluation is your primary use case.
4. How does context precision differ from context recall, and which requires ground truth?
Show Answer
Context precision measures the proportion of retrieved chunks that are actually relevant to answering the question (penalizing noise in retrieval). Context recall measures the proportion of information needed to answer the question that was actually retrieved (penalizing missed information). Both require ground truth in the RAGAS framework: context precision needs annotated relevance judgments, and context recall needs a ground truth answer to assess whether all necessary information was retrieved.
5. What is a practical strategy for validating that an LLM-based evaluation metric correlates with human judgment?
Show Answer
Create a calibration set of 50 to 100 examples with human-annotated scores. Run the LLM-based metric on the same examples and compute the correlation (Spearman or Pearson) between automated and human scores. A correlation above 0.7 is generally acceptable. Also examine the examples where the automated metric and human judgment disagree most strongly to understand the metric's failure modes. Repeat this validation periodically, especially after changing the judge model or prompt.

Key Takeaways