RAG systems and agents introduce evaluation challenges that go far beyond standard LLM metrics. A RAG pipeline can fail at retrieval (wrong documents), at generation (hallucinating beyond retrieved context), or at both. An agent can select the wrong tool, call tools in the wrong order, or produce a correct final answer through an unsafe trajectory. This section covers specialized evaluation metrics and frameworks for these compound systems, including RAGAS metrics (faithfulness, answer relevancy, context precision, context recall), agent task completion and trajectory evaluation, and practical frameworks like DeepEval, Ragas, and Phoenix.
1. Why RAG and Agent Evaluation Is Different
Standard LLM evaluation treats the model as a black box: question in, answer out. RAG and agent systems are multi-component pipelines where failures can occur at any stage. Evaluating only the final answer misses critical information about where the system failed and how to fix it. Component-level evaluation isolates retrieval quality from generation quality, enabling targeted improvements.
For RAG systems, the two fundamental questions are: (1) Did the retriever find the right information? (2) Did the generator use that information faithfully? For agents, the questions expand to: Did the agent choose the right tools? Did it call them with correct parameters? Did it follow a safe and efficient trajectory? Was the final answer correct?
2. RAGAS Metrics
RAGAS (Retrieval Augmented Generation Assessment) is a framework that decomposes RAG evaluation into component-level metrics. Each metric isolates a specific aspect of the pipeline, making it possible to diagnose whether failures originate in retrieval, generation, or both.
Core RAGAS Metrics
| Metric | What It Measures | Requires Ground Truth? | Score Range |
|---|---|---|---|
| Faithfulness | Whether the answer is supported by the retrieved context | No (uses context only) | 0 to 1 |
| Answer Relevancy | Whether the answer addresses the question | No | 0 to 1 |
| Context Precision | Whether retrieved chunks are relevant (not noisy) | Yes | 0 to 1 |
| Context Recall | Whether all necessary information was retrieved | Yes | 0 to 1 |
| Answer Correctness | Factual accuracy of the final answer | Yes | 0 to 1 |
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, ) from datasets import Dataset # Prepare evaluation dataset with required columns eval_data = { "question": [ "What is the capital of France?", "How does photosynthesis work?", "What causes tides?", ], "answer": [ "The capital of France is Paris, which is also its largest city.", "Photosynthesis converts CO2 and water into glucose using sunlight.", "Tides are caused by gravitational pull of the moon and sun.", ], "contexts": [ ["Paris is the capital and most populous city of France."], ["Photosynthesis is a process by which plants convert light energy into chemical energy."], ["Ocean tides are caused by the gravitational forces of the moon."], ], "ground_truth": [ "Paris is the capital of France.", "Photosynthesis converts light energy, CO2, and water into glucose and oxygen.", "Tides are primarily caused by the moon's gravitational pull on Earth's oceans.", ], } dataset = Dataset.from_dict(eval_data) # Run RAGAS evaluation results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], ) print(results) print("\nPer-question breakdown:") df = results.to_pandas() print(df.to_string(index=False))
Faithfulness measures whether the answer is supported by the retrieved context, while correctness measures whether it matches the ground truth. An answer can be faithful (derived only from context) but incorrect (the context itself was wrong or incomplete). Conversely, an answer can be correct but unfaithful (the model "knew" the answer and ignored the context). Both metrics are needed to fully diagnose RAG failures.
Custom Faithfulness Scorer
from openai import OpenAI import json client = OpenAI() def score_faithfulness(question: str, answer: str, contexts: list[str]) -> dict: """Score faithfulness: is the answer grounded in the provided context? Uses an LLM judge to decompose the answer into claims and verify each claim against the context. """ context_text = "\n\n".join(f"Context {i+1}: {c}" for i, c in enumerate(contexts)) prompt = f"""Evaluate the faithfulness of an answer to the provided context. QUESTION: {question} CONTEXT: {context_text} ANSWER: {answer} Instructions: 1. Decompose the answer into individual factual claims 2. For each claim, determine if it is SUPPORTED or NOT SUPPORTED by the context 3. Return JSON with: - "claims": list of {{"claim": str, "supported": bool, "evidence": str}} - "faithfulness_score": fraction of supported claims (0.0 to 1.0)""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], response_format={"type": "json_object"}, temperature=0, ) return json.loads(response.choices[0].message.content) # Example usage result = score_faithfulness( question="What is the capital of France?", answer="Paris is the capital of France with a population of 2.1 million.", contexts=["Paris is the capital and most populous city of France."] ) print(json.dumps(result, indent=2))
3. Agent Evaluation
Evaluating agents is fundamentally harder than evaluating simple question-answering systems because agents take multi-step actions with real-world side effects. A correct final answer does not mean the agent followed a safe or efficient path to get there. Agent evaluation therefore requires assessing multiple dimensions: task completion, tool selection accuracy, parameter correctness, trajectory efficiency, and safety.
Evaluation Dimensions for Agents
- Task completion: Did the agent achieve the stated goal?
- Tool accuracy: Did the agent select the correct tools for each step?
- Parameter correctness: Were tool calls made with valid and appropriate parameters?
- Trajectory efficiency: Did the agent take unnecessary steps or redundant tool calls?
- Safety: Did the agent avoid dangerous or unauthorized actions?
- Cost efficiency: How many LLM calls and tokens were consumed?
from dataclasses import dataclass from typing import Optional @dataclass class ToolCall: """A single tool call in an agent trajectory.""" tool_name: str parameters: dict result: Optional[str] = None is_correct_tool: Optional[bool] = None @dataclass class AgentTrajectory: """Complete record of an agent's execution.""" task: str tool_calls: list[ToolCall] final_answer: str total_tokens: int = 0 total_latency_ms: float = 0 def evaluate_agent_trajectory( trajectory: AgentTrajectory, ideal_tools: list[str], expected_answer: str, answer_checker=None ) -> dict: """Evaluate an agent trajectory against an ideal reference. Args: trajectory: The actual execution trajectory ideal_tools: Ordered list of expected tool names expected_answer: The ground truth answer answer_checker: Optional function for fuzzy answer matching """ actual_tools = [tc.tool_name for tc in trajectory.tool_calls] # Task completion: did the agent get the right answer? if answer_checker: task_complete = answer_checker(trajectory.final_answer, expected_answer) else: task_complete = trajectory.final_answer.strip().lower() == expected_answer.strip().lower() # Tool accuracy: fraction of calls using correct tools correct_tools = sum( 1 for t in actual_tools if t in ideal_tools ) tool_accuracy = correct_tools / len(actual_tools) if actual_tools else 0 # Trajectory efficiency: ideal steps / actual steps efficiency = min(len(ideal_tools) / len(actual_tools), 1.0) if actual_tools else 0 # Redundancy: duplicate consecutive tool calls redundant = sum( 1 for i in range(1, len(actual_tools)) if actual_tools[i] == actual_tools[i - 1] ) return { "task_completed": task_complete, "tool_accuracy": round(tool_accuracy, 3), "trajectory_efficiency": round(efficiency, 3), "num_steps": len(actual_tools), "ideal_steps": len(ideal_tools), "redundant_calls": redundant, "total_tokens": trajectory.total_tokens, }
Agent evaluation should weight task completion most heavily, since a correct answer through an inefficient path is better than an efficient trajectory with a wrong answer. However, trajectory quality matters for cost, latency, and safety. In production, an agent that consistently takes extra steps will cost more and may expose the system to more failure points. Evaluate both dimensions and set acceptable thresholds for each.
4. Evaluation Frameworks Comparison
| Framework | Focus | Key Metrics | Strengths | Limitations |
|---|---|---|---|---|
| Ragas | RAG evaluation | Faithfulness, relevancy, context precision/recall | Most comprehensive RAG metrics; good HF integration | Relies on LLM judge; can be slow |
| DeepEval | General LLM testing | Hallucination, bias, toxicity, custom metrics | pytest integration; CI/CD friendly | Less RAG-specific depth than Ragas |
| Phoenix (Arize) | Observability + eval | Trace-level metrics, embedding analysis | Visual UI; traces + evals combined | Heavier infrastructure requirement |
| TruLens | Feedback functions | Groundedness, relevance, custom feedback | Modular feedback system; provider-agnostic | Smaller community than alternatives |
| promptfoo | Prompt testing | Assertion-based, custom evals | CLI-first; fast iteration; CI/CD native | Less suited for complex agent evaluation |
from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import ( FaithfulnessMetric, AnswerRelevancyMetric, HallucinationMetric, ) def test_rag_with_deepeval(): """Example: Testing RAG output quality with DeepEval.""" test_case = LLMTestCase( input="What are the benefits of solar energy?", actual_output="Solar energy is renewable, reduces electricity bills, " "and has low maintenance costs.", retrieval_context=[ "Solar energy is a renewable source of power that reduces " "dependence on fossil fuels.", "Solar panels require minimal maintenance and can reduce " "electricity bills by up to 50%.", ], ) # Define metrics with thresholds faithfulness = FaithfulnessMetric(threshold=0.7, model="gpt-4o") relevancy = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o") hallucination = HallucinationMetric(threshold=0.5, model="gpt-4o") # Assert all metrics pass (integrates with pytest) assert_test(test_case, [faithfulness, relevancy, hallucination]) # Run with: pytest test_rag.py -v
All framework-computed metrics that rely on LLM judges inherit the biases and limitations of the judge model. Faithfulness scores can be unreliable when the context is ambiguous or when the judge model hallucinates its own assessment. Always validate framework metrics against a set of human-annotated examples before trusting them for production decisions. Establish the correlation between the automated metric and human judgment on your specific data.
📝 Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Evaluate RAG at component level, not just end-to-end. Use retrieval metrics (context precision, context recall) and generation metrics (faithfulness, answer relevancy) to isolate where failures occur and guide targeted improvements.
- Faithfulness is the most critical RAG metric. A RAG system that generates unfaithful answers (hallucinating beyond context) undermines the entire purpose of retrieval-augmented generation. Monitor faithfulness continuously in production.
- Agent evaluation requires trajectory analysis. Assess tool accuracy, parameter correctness, and trajectory efficiency alongside task completion. An agent that completes tasks through unsafe or wasteful paths is a production liability.
- Choose frameworks based on your primary need. Ragas excels at RAG-specific metrics, DeepEval integrates with testing pipelines, Phoenix combines observability with evaluation, and promptfoo enables rapid prompt iteration.
- Validate automated metrics against human judgment. LLM-based evaluation metrics inherit judge model biases. Always establish correlation with human annotations before trusting automated scores for deployment decisions.