Naive RAG performs a single retrieval step, but complex research questions require multiple rounds of searching, reading, reflecting, and refining. Agentic RAG systems give the LLM the ability to decide what to search for, evaluate whether retrieved results are sufficient, generate follow-up queries, and synthesize findings from multiple sources. This transforms RAG from a simple retrieve-and-generate pattern into an autonomous research workflow that can tackle multi-faceted questions requiring information from diverse sources: document stores, web search, databases, and APIs.
1. From Single-Shot to Iterative Retrieval
Consider the research question: "How do the climate policies of the top 5 GDP countries compare in their approach to carbon taxation, and what evidence exists for the effectiveness of each approach?" This question cannot be answered with a single retrieval step. It requires identifying the top 5 GDP countries, finding each country's climate policy, extracting carbon taxation details, finding effectiveness studies for each approach, and then synthesizing the comparison.
Agentic RAG addresses this by giving the LLM a loop: plan what information is needed, retrieve it, evaluate whether it is sufficient, and either proceed to synthesis or generate follow-up queries. This iterative approach mirrors how a human researcher would tackle such a question.
1.1 Query Decomposition
The first step in agentic RAG is decomposing a complex query into smaller, answerable sub-queries. Each sub-query targets a specific piece of information needed to construct the final answer. The decomposition can be sequential (each sub-query depends on the previous answer) or parallel (sub-queries are independent and can be executed concurrently).
from openai import OpenAI import json import asyncio client = OpenAI() def decompose_query(query): """Break a complex question into sub-queries.""" response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": """Decompose the user's research question into sub-queries. Return JSON with: - "sub_queries": list of specific, searchable questions - "dependencies": dict mapping query index to indices it depends on (empty list if independent) - "strategy": "parallel" or "sequential" Keep sub-queries focused and searchable.""" }, { "role": "user", "content": query }], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content) # Example usage plan = decompose_query( "How do carbon tax policies in the EU and US compare, " "and what evidence exists for their effectiveness?" ) # Returns sub-queries like: # 1. "What are the current carbon tax policies in the EU?" # 2. "What are the current carbon tax policies in the US?" # 3. "What studies evaluate EU carbon tax effectiveness?" # 4. "What studies evaluate US carbon pricing effectiveness?"
2. Parallel Search and Multi-Source Retrieval
Once sub-queries are generated, an agentic RAG system can execute searches in parallel across multiple sources. Unlike naive RAG, which searches a single vector store, agentic RAG can simultaneously query document stores, web search APIs, databases, and specialized APIs, then combine results from all sources.
import asyncio from typing import List, Dict async def search_web(query: str) -> List[Dict]: """Search the web using a search API.""" # Implementation with Tavily, Serper, or Brave Search pass async def search_documents(query: str, collection) -> List[Dict]: """Search internal document store.""" results = collection.query(query_texts=[query], n_results=5) return [{"text": d, "source": "internal_docs"} for d in results["documents"][0]] async def search_database(query: str) -> List[Dict]: """Convert query to SQL and search database.""" # Text-to-SQL pipeline (covered in Section 19.5) pass async def multi_source_search(sub_queries: List[str], collection): """Execute sub-queries in parallel across sources.""" all_results = {} async def search_one(query): # Search all sources in parallel for each query web, docs, db = await asyncio.gather( search_web(query), search_documents(query, collection), search_database(query), return_exceptions=True ) return { "query": query, "web": web if not isinstance(web, Exception) else [], "docs": docs if not isinstance(docs, Exception) else [], "db": db if not isinstance(db, Exception) else [] } # Execute all sub-queries in parallel tasks = [search_one(q) for q in sub_queries] results = await asyncio.gather(*tasks) return results
3. Iterative Refinement and Follow-Up Generation
After initial retrieval, the agent evaluates whether the gathered information is sufficient to answer the original question. If gaps remain, the agent generates follow-up queries targeting the missing information. This loop continues until the agent determines it has enough evidence or reaches a maximum iteration limit.
def evaluate_and_refine(original_query, gathered_info, max_iterations=3): """Iteratively refine retrieval until sufficient.""" for iteration in range(max_iterations): # Ask the LLM to evaluate sufficiency eval_response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": """Evaluate whether the gathered information is sufficient to comprehensively answer the question. Return JSON with: - "sufficient": true/false - "missing": list of what information is still needed - "follow_up_queries": list of queries to fill gaps - "confidence": 0.0 to 1.0""" }, { "role": "user", "content": f"""Original question: {original_query} Gathered information: {json.dumps(gathered_info, indent=2)}""" }], response_format={"type": "json_object"} ) evaluation = json.loads( eval_response.choices[0].message.content ) if evaluation["sufficient"] or evaluation["confidence"] > 0.85: return gathered_info, evaluation # Execute follow-up queries follow_ups = evaluation["follow_up_queries"] new_info = retrieve_for_queries(follow_ups) gathered_info.extend(new_info) return gathered_info, evaluation
Production agentic RAG systems must balance thoroughness with cost and latency. Each iteration involves LLM calls for evaluation and follow-up generation, plus retrieval costs. Common budget strategies include: (1) a hard iteration cap (typically 3 to 5 rounds), (2) a total token budget across all iterations, (3) diminishing returns detection (stop when new iterations add little new information), and (4) time budgets for latency-sensitive applications.
4. Source Credibility Assessment
Not all retrieved sources are equally trustworthy. A critical feature of agentic RAG systems is the ability to assess source credibility and weight information accordingly. This is especially important when combining web search results (which may include misinformation) with curated internal documents.
4.1 Credibility Signals
- Source authority: Is the source a recognized authority in the domain? Academic papers, government agencies, and established organizations carry more weight than anonymous blogs.
- Recency: For time-sensitive topics, more recent sources are generally preferred. A 2024 policy document supersedes a 2019 version.
- Consistency: Claims corroborated by multiple independent sources are more reliable than claims from a single source.
- Specificity: Sources that provide specific data, citations, and methodology are more credible than those making vague claims.
- Bias indicators: Sources with obvious commercial interests, political slant, or advocacy goals should be flagged and their claims treated with additional scrutiny.
5. Synthesis and Report Generation
The final stage of agentic RAG synthesizes all gathered information into a coherent, cited response. For complex research questions, this often means generating a structured report with sections, findings from multiple sources, points of agreement and disagreement, and explicit citations.
def synthesize_research(original_query, gathered_info, credibility_scores): """Synthesize gathered information into a research report.""" # Sort by credibility, place highest-trust sources first sorted_info = sorted( zip(gathered_info, credibility_scores), key=lambda x: x[1], reverse=True ) context = "\n\n".join([ f"[Source {i+1} | Credibility: {score:.1f}/5]\n{info['text']}" for i, (info, score) in enumerate(sorted_info) ]) response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "system", "content": """You are a research analyst. Synthesize the provided sources into a comprehensive answer. Guidelines: - Cite sources by number [Source N] - Note where sources agree or disagree - Flag claims supported by only one source - Prioritize high-credibility sources - Acknowledge gaps in the evidence - Structure the answer with clear sections""" }, { "role": "user", "content": f"""Research Question: {original_query} Sources: {context}""" }], temperature=0.2 ) return response.choices[0].message.content
The most effective synthesis prompts instruct the LLM to handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide. This "epistemic honesty" approach builds far more user trust than confidently presenting a single answer that papers over genuine uncertainty.
6. Deep Research Architectures
Several production systems have implemented deep research capabilities that go well beyond simple agentic RAG. These systems typically combine query planning, multi-source retrieval, iterative refinement, and long-form synthesis into a unified workflow.
6.1 Architecture Comparison
| Feature | Naive RAG | Agentic RAG | Deep Research |
|---|---|---|---|
| Retrieval steps | 1 | 2 to 5 | 10+ |
| Sources | Single vector store | Multiple stores | Web + docs + DB + APIs |
| Query planning | None | Decomposition | Hierarchical plan tree |
| Self-evaluation | None | Sufficiency check | Multi-criteria assessment |
| Output format | Short answer | Cited answer | Structured report |
| Typical latency | 2 to 5 seconds | 10 to 30 seconds | 1 to 10 minutes |
| Cost per query | $0.01 to $0.05 | $0.05 to $0.50 | $0.50 to $5.00 |
Agentic RAG introduces new failure modes beyond those of naive RAG. Query drift occurs when follow-up queries gradually shift away from the original question, retrieving increasingly irrelevant information. Infinite loops occur when the agent never reaches a "sufficient" evaluation. Conflation occurs when the agent mixes information from different sub-queries, creating false associations. Mitigate these with hard iteration limits, query relevance checks against the original question, and explicit source tracking throughout the pipeline.
Section 19.4 Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Agentic RAG transforms retrieval into research: By giving the LLM a plan-retrieve-evaluate loop, complex multi-faceted questions become tractable through iterative decomposition and refinement.
- Query decomposition is the foundation: Breaking complex questions into focused sub-queries (parallel or sequential) enables targeted retrieval and prevents the system from missing critical pieces of information.
- Multi-source retrieval combines complementary strengths: Web search provides breadth and recency; document stores provide curated depth; databases provide structured data. Searching all three in parallel yields the most comprehensive results.
- Source credibility prevents misinformation amplification: Assessing authority, recency, consistency, specificity, and bias before synthesis ensures the final answer is grounded in trustworthy evidence.
- Budget your agent carefully: Each iteration costs money and time. Use hard iteration limits, diminishing returns detection, and relevance checks to prevent runaway costs and query drift.