Section 19.4: Deep Research & Agentic RAG

★ Big Picture

Naive RAG performs a single retrieval step, but complex research questions require multiple rounds of searching, reading, reflecting, and refining. Agentic RAG systems give the LLM the ability to decide what to search for, evaluate whether retrieved results are sufficient, generate follow-up queries, and synthesize findings from multiple sources. This transforms RAG from a simple retrieve-and-generate pattern into an autonomous research workflow that can tackle multi-faceted questions requiring information from diverse sources: document stores, web search, databases, and APIs.

1. From Single-Shot to Iterative Retrieval

Consider the research question: "How do the climate policies of the top 5 GDP countries compare in their approach to carbon taxation, and what evidence exists for the effectiveness of each approach?" This question cannot be answered with a single retrieval step. It requires identifying the top 5 GDP countries, finding each country's climate policy, extracting carbon taxation details, finding effectiveness studies for each approach, and then synthesizing the comparison.

Agentic RAG addresses this by giving the LLM a loop: plan what information is needed, retrieve it, evaluate whether it is sufficient, and either proceed to synthesis or generate follow-up queries. This iterative approach mirrors how a human researcher would tackle such a question.

1.1 Query Decomposition

The first step in agentic RAG is decomposing a complex query into smaller, answerable sub-queries. Each sub-query targets a specific piece of information needed to construct the final answer. The decomposition can be sequential (each sub-query depends on the previous answer) or parallel (sub-queries are independent and can be executed concurrently).

from openai import OpenAI
import json
import asyncio

client = OpenAI()

def decompose_query(query):
    """Break a complex question into sub-queries."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Decompose the user's research question into
sub-queries. Return JSON with:
- "sub_queries": list of specific, searchable questions
- "dependencies": dict mapping query index to indices
  it depends on (empty list if independent)
- "strategy": "parallel" or "sequential"

Keep sub-queries focused and searchable."""
        }, {
            "role": "user",
            "content": query
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)


# Example usage
plan = decompose_query(
    "How do carbon tax policies in the EU and US compare, "
    "and what evidence exists for their effectiveness?"
)
# Returns sub-queries like:
# 1. "What are the current carbon tax policies in the EU?"
# 2. "What are the current carbon tax policies in the US?"
# 3. "What studies evaluate EU carbon tax effectiveness?"
# 4. "What studies evaluate US carbon pricing effectiveness?"

Figure 19.9: Agentic RAG iterates through decomposition, multi-source retrieval, and sufficiency evaluation before synthesizing a final answer.

2. Parallel Search and Multi-Source Retrieval

Once sub-queries are generated, an agentic RAG system can execute searches in parallel across multiple sources. Unlike naive RAG, which searches a single vector store, agentic RAG can simultaneously query document stores, web search APIs, databases, and specialized APIs, then combine results from all sources.

import asyncio
from typing import List, Dict

async def search_web(query: str) -> List[Dict]:
    """Search the web using a search API."""
    # Implementation with Tavily, Serper, or Brave Search
    pass

async def search_documents(query: str, collection) -> List[Dict]:
    """Search internal document store."""
    results = collection.query(query_texts=[query], n_results=5)
    return [{"text": d, "source": "internal_docs"}
            for d in results["documents"][0]]

async def search_database(query: str) -> List[Dict]:
    """Convert query to SQL and search database."""
    # Text-to-SQL pipeline (covered in Section 19.5)
    pass

async def multi_source_search(sub_queries: List[str], collection):
    """Execute sub-queries in parallel across sources."""
    all_results = {}

    async def search_one(query):
        # Search all sources in parallel for each query
        web, docs, db = await asyncio.gather(
            search_web(query),
            search_documents(query, collection),
            search_database(query),
            return_exceptions=True
        )
        return {
            "query": query,
            "web": web if not isinstance(web, Exception) else [],
            "docs": docs if not isinstance(docs, Exception) else [],
            "db": db if not isinstance(db, Exception) else []
        }

    # Execute all sub-queries in parallel
    tasks = [search_one(q) for q in sub_queries]
    results = await asyncio.gather(*tasks)
    return results

3. Iterative Refinement and Follow-Up Generation

After initial retrieval, the agent evaluates whether the gathered information is sufficient to answer the original question. If gaps remain, the agent generates follow-up queries targeting the missing information. This loop continues until the agent determines it has enough evidence or reaches a maximum iteration limit.

def evaluate_and_refine(original_query, gathered_info, max_iterations=3):
    """Iteratively refine retrieval until sufficient."""

    for iteration in range(max_iterations):
        # Ask the LLM to evaluate sufficiency
        eval_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": """Evaluate whether the gathered information is
sufficient to comprehensively answer the question.
Return JSON with:
- "sufficient": true/false
- "missing": list of what information is still needed
- "follow_up_queries": list of queries to fill gaps
- "confidence": 0.0 to 1.0"""
            }, {
                "role": "user",
                "content": f"""Original question: {original_query}

Gathered information:
{json.dumps(gathered_info, indent=2)}"""
            }],
            response_format={"type": "json_object"}
        )

        evaluation = json.loads(
            eval_response.choices[0].message.content
        )

        if evaluation["sufficient"] or evaluation["confidence"] > 0.85:
            return gathered_info, evaluation

        # Execute follow-up queries
        follow_ups = evaluation["follow_up_queries"]
        new_info = retrieve_for_queries(follow_ups)
        gathered_info.extend(new_info)

    return gathered_info, evaluation

ⓘ Iteration Budgets

Production agentic RAG systems must balance thoroughness with cost and latency. Each iteration involves LLM calls for evaluation and follow-up generation, plus retrieval costs. Common budget strategies include: (1) a hard iteration cap (typically 3 to 5 rounds), (2) a total token budget across all iterations, (3) diminishing returns detection (stop when new iterations add little new information), and (4) time budgets for latency-sensitive applications.

4. Source Credibility Assessment

Not all retrieved sources are equally trustworthy. A critical feature of agentic RAG systems is the ability to assess source credibility and weight information accordingly. This is especially important when combining web search results (which may include misinformation) with curated internal documents.

4.1 Credibility Signals

Source authority: Is the source a recognized authority in the domain? Academic papers, government agencies, and established organizations carry more weight than anonymous blogs.
Recency: For time-sensitive topics, more recent sources are generally preferred. A 2024 policy document supersedes a 2019 version.
Consistency: Claims corroborated by multiple independent sources are more reliable than claims from a single source.
Specificity: Sources that provide specific data, citations, and methodology are more credible than those making vague claims.
Bias indicators: Sources with obvious commercial interests, political slant, or advocacy goals should be flagged and their claims treated with additional scrutiny.

Figure 19.10: Source credibility assessment weights retrieved information by authority and reliability before passing it to the LLM.

5. Synthesis and Report Generation

The final stage of agentic RAG synthesizes all gathered information into a coherent, cited response. For complex research questions, this often means generating a structured report with sections, findings from multiple sources, points of agreement and disagreement, and explicit citations.

def synthesize_research(original_query, gathered_info, credibility_scores):
    """Synthesize gathered information into a research report."""

    # Sort by credibility, place highest-trust sources first
    sorted_info = sorted(
        zip(gathered_info, credibility_scores),
        key=lambda x: x[1],
        reverse=True
    )

    context = "\n\n".join([
        f"[Source {i+1} | Credibility: {score:.1f}/5]\n{info['text']}"
        for i, (info, score) in enumerate(sorted_info)
    ])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """You are a research analyst. Synthesize the
provided sources into a comprehensive answer.

Guidelines:
- Cite sources by number [Source N]
- Note where sources agree or disagree
- Flag claims supported by only one source
- Prioritize high-credibility sources
- Acknowledge gaps in the evidence
- Structure the answer with clear sections"""
        }, {
            "role": "user",
            "content": f"""Research Question: {original_query}

Sources:
{context}"""
        }],
        temperature=0.2
    )

    return response.choices[0].message.content

★ Key Insight

The most effective synthesis prompts instruct the LLM to handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide. This "epistemic honesty" approach builds far more user trust than confidently presenting a single answer that papers over genuine uncertainty.

6. Deep Research Architectures

Several production systems have implemented deep research capabilities that go well beyond simple agentic RAG. These systems typically combine query planning, multi-source retrieval, iterative refinement, and long-form synthesis into a unified workflow.

6.1 Architecture Comparison

Feature	Naive RAG	Agentic RAG	Deep Research
Retrieval steps	1	2 to 5	10+
Sources	Single vector store	Multiple stores	Web + docs + DB + APIs
Query planning	None	Decomposition	Hierarchical plan tree
Self-evaluation	None	Sufficiency check	Multi-criteria assessment
Output format	Short answer	Cited answer	Structured report
Typical latency	2 to 5 seconds	10 to 30 seconds	1 to 10 minutes
Cost per query	$0.01 to $0.05	$0.05 to $0.50	$0.50 to $5.00

⚠ Agentic RAG Failure Modes

Agentic RAG introduces new failure modes beyond those of naive RAG. Query drift occurs when follow-up queries gradually shift away from the original question, retrieving increasingly irrelevant information. Infinite loops occur when the agent never reaches a "sufficient" evaluation. Conflation occurs when the agent mixes information from different sub-queries, creating false associations. Mitigate these with hard iteration limits, query relevance checks against the original question, and explicit source tracking throughout the pipeline.

Figure 19.11: Deep research pipelines chain planning, gathering, verification, refinement, and synthesis into a multi-phase workflow with iteration.

Section 19.4 Quiz

1. Why does a complex research question require iterative retrieval rather than a single retrieval step?

Show Answer

Complex questions often require multiple pieces of information that cannot be captured by a single query. They may need sequential resolution (the answer to one sub-question is needed to formulate the next), information from different source types (web, documents, databases), and verification through cross-referencing. A single retrieval step would miss these dependencies and return incomplete or tangentially relevant results.

2. What is the difference between parallel and sequential query decomposition?

Show Answer

Parallel decomposition produces independent sub-queries that can all be executed simultaneously (e.g., "What is country A's policy?" and "What is country B's policy?"). Sequential decomposition produces dependent sub-queries where each depends on the answer to the previous one (e.g., "Who are the top 5 GDP countries?" must be answered before "What is each country's carbon tax rate?"). Many complex queries involve a mix of both patterns.

3. What credibility signals should an agentic RAG system consider when evaluating sources?

Show Answer

Five key credibility signals: (1) Source authority (recognized expertise in the domain), (2) Recency (newer information for time-sensitive topics), (3) Consistency (corroboration across multiple independent sources), (4) Specificity (concrete data and citations vs. vague claims), and (5) Bias indicators (commercial interests, political slant, or advocacy motivations that might skew the information).

4. What is query drift and how can it be mitigated in agentic RAG?

Show Answer

Query drift occurs when follow-up queries gradually shift away from the original question, retrieving increasingly irrelevant information with each iteration. It happens because follow-up queries are generated based on recent context rather than the original question. Mitigation strategies include: always comparing follow-up queries against the original question for relevance, including the original question in every refinement prompt, setting hard iteration limits, and implementing a relevance score that measures how related each new result is to the original query.

5. How should a synthesis prompt handle conflicting information from multiple sources?

Show Answer

Rather than silently picking one version, the synthesis should present both perspectives with their supporting evidence and source citations. It should note which sources agree and disagree, highlight claims supported by only one source versus multiple sources, prioritize high-credibility sources, and let the reader make the final judgment. This "epistemic honesty" approach builds more user trust than confidently presenting a single answer that hides genuine disagreement.

Key Takeaways

Agentic RAG transforms retrieval into research: By giving the LLM a plan-retrieve-evaluate loop, complex multi-faceted questions become tractable through iterative decomposition and refinement.
Query decomposition is the foundation: Breaking complex questions into focused sub-queries (parallel or sequential) enables targeted retrieval and prevents the system from missing critical pieces of information.
Multi-source retrieval combines complementary strengths: Web search provides breadth and recency; document stores provide curated depth; databases provide structured data. Searching all three in parallel yields the most comprehensive results.
Source credibility prevents misinformation amplification: Assessing authority, recency, consistency, specificity, and bias before synthesis ensures the final answer is grounded in trustworthy evidence.
Budget your agent carefully: Each iteration costs money and time. Use hard iteration limits, diminishing returns detection, and relevance checks to prevent runaway costs and query drift.