Module 19 · Section 19.4

Deep Research & Agentic RAG

Building autonomous research agents that decompose queries, search iteratively, synthesize findings, and assess source credibility
★ Big Picture

Naive RAG performs a single retrieval step, but complex research questions require multiple rounds of searching, reading, reflecting, and refining. Agentic RAG systems give the LLM the ability to decide what to search for, evaluate whether retrieved results are sufficient, generate follow-up queries, and synthesize findings from multiple sources. This transforms RAG from a simple retrieve-and-generate pattern into an autonomous research workflow that can tackle multi-faceted questions requiring information from diverse sources: document stores, web search, databases, and APIs.

1. From Single-Shot to Iterative Retrieval

Consider the research question: "How do the climate policies of the top 5 GDP countries compare in their approach to carbon taxation, and what evidence exists for the effectiveness of each approach?" This question cannot be answered with a single retrieval step. It requires identifying the top 5 GDP countries, finding each country's climate policy, extracting carbon taxation details, finding effectiveness studies for each approach, and then synthesizing the comparison.

Agentic RAG addresses this by giving the LLM a loop: plan what information is needed, retrieve it, evaluate whether it is sufficient, and either proceed to synthesis or generate follow-up queries. This iterative approach mirrors how a human researcher would tackle such a question.

1.1 Query Decomposition

The first step in agentic RAG is decomposing a complex query into smaller, answerable sub-queries. Each sub-query targets a specific piece of information needed to construct the final answer. The decomposition can be sequential (each sub-query depends on the previous answer) or parallel (sub-queries are independent and can be executed concurrently).

from openai import OpenAI
import json
import asyncio

client = OpenAI()

def decompose_query(query):
    """Break a complex question into sub-queries."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """Decompose the user's research question into
sub-queries. Return JSON with:
- "sub_queries": list of specific, searchable questions
- "dependencies": dict mapping query index to indices
  it depends on (empty list if independent)
- "strategy": "parallel" or "sequential"

Keep sub-queries focused and searchable."""
        }, {
            "role": "user",
            "content": query
        }],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)


# Example usage
plan = decompose_query(
    "How do carbon tax policies in the EU and US compare, "
    "and what evidence exists for their effectiveness?"
)
# Returns sub-queries like:
# 1. "What are the current carbon tax policies in the EU?"
# 2. "What are the current carbon tax policies in the US?"
# 3. "What studies evaluate EU carbon tax effectiveness?"
# 4. "What studies evaluate US carbon pricing effectiveness?"
Agentic RAG: Iterative Research Loop Complex Query 1. Decompose Break into sub-queries 2. Search & Retrieve Web, docs, DB, APIs 3. Evaluate Sufficient? Credible? Insufficient: refine query Sufficient 4. Synthesize Combine + cite sources Research Report Web Docs DB
Figure 19.9: Agentic RAG iterates through decomposition, multi-source retrieval, and sufficiency evaluation before synthesizing a final answer.

2. Parallel Search and Multi-Source Retrieval

Once sub-queries are generated, an agentic RAG system can execute searches in parallel across multiple sources. Unlike naive RAG, which searches a single vector store, agentic RAG can simultaneously query document stores, web search APIs, databases, and specialized APIs, then combine results from all sources.

import asyncio
from typing import List, Dict

async def search_web(query: str) -> List[Dict]:
    """Search the web using a search API."""
    # Implementation with Tavily, Serper, or Brave Search
    pass

async def search_documents(query: str, collection) -> List[Dict]:
    """Search internal document store."""
    results = collection.query(query_texts=[query], n_results=5)
    return [{"text": d, "source": "internal_docs"}
            for d in results["documents"][0]]

async def search_database(query: str) -> List[Dict]:
    """Convert query to SQL and search database."""
    # Text-to-SQL pipeline (covered in Section 19.5)
    pass

async def multi_source_search(sub_queries: List[str], collection):
    """Execute sub-queries in parallel across sources."""
    all_results = {}

    async def search_one(query):
        # Search all sources in parallel for each query
        web, docs, db = await asyncio.gather(
            search_web(query),
            search_documents(query, collection),
            search_database(query),
            return_exceptions=True
        )
        return {
            "query": query,
            "web": web if not isinstance(web, Exception) else [],
            "docs": docs if not isinstance(docs, Exception) else [],
            "db": db if not isinstance(db, Exception) else []
        }

    # Execute all sub-queries in parallel
    tasks = [search_one(q) for q in sub_queries]
    results = await asyncio.gather(*tasks)
    return results

3. Iterative Refinement and Follow-Up Generation

After initial retrieval, the agent evaluates whether the gathered information is sufficient to answer the original question. If gaps remain, the agent generates follow-up queries targeting the missing information. This loop continues until the agent determines it has enough evidence or reaches a maximum iteration limit.

def evaluate_and_refine(original_query, gathered_info, max_iterations=3):
    """Iteratively refine retrieval until sufficient."""

    for iteration in range(max_iterations):
        # Ask the LLM to evaluate sufficiency
        eval_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "system",
                "content": """Evaluate whether the gathered information is
sufficient to comprehensively answer the question.
Return JSON with:
- "sufficient": true/false
- "missing": list of what information is still needed
- "follow_up_queries": list of queries to fill gaps
- "confidence": 0.0 to 1.0"""
            }, {
                "role": "user",
                "content": f"""Original question: {original_query}

Gathered information:
{json.dumps(gathered_info, indent=2)}"""
            }],
            response_format={"type": "json_object"}
        )

        evaluation = json.loads(
            eval_response.choices[0].message.content
        )

        if evaluation["sufficient"] or evaluation["confidence"] > 0.85:
            return gathered_info, evaluation

        # Execute follow-up queries
        follow_ups = evaluation["follow_up_queries"]
        new_info = retrieve_for_queries(follow_ups)
        gathered_info.extend(new_info)

    return gathered_info, evaluation
ⓘ Iteration Budgets

Production agentic RAG systems must balance thoroughness with cost and latency. Each iteration involves LLM calls for evaluation and follow-up generation, plus retrieval costs. Common budget strategies include: (1) a hard iteration cap (typically 3 to 5 rounds), (2) a total token budget across all iterations, (3) diminishing returns detection (stop when new iterations add little new information), and (4) time budgets for latency-sensitive applications.

4. Source Credibility Assessment

Not all retrieved sources are equally trustworthy. A critical feature of agentic RAG systems is the ability to assess source credibility and weight information accordingly. This is especially important when combining web search results (which may include misinformation) with curated internal documents.

4.1 Credibility Signals

Source Credibility Assessment Pipeline Peer-reviewed Gov / Official News articles Blog / Social Credibility Evaluator Authority, recency, consistency, bias Weighted Context High-trust first Low-trust flagged Conflicting noted LLM Generate
Figure 19.10: Source credibility assessment weights retrieved information by authority and reliability before passing it to the LLM.

5. Synthesis and Report Generation

The final stage of agentic RAG synthesizes all gathered information into a coherent, cited response. For complex research questions, this often means generating a structured report with sections, findings from multiple sources, points of agreement and disagreement, and explicit citations.

def synthesize_research(original_query, gathered_info, credibility_scores):
    """Synthesize gathered information into a research report."""

    # Sort by credibility, place highest-trust sources first
    sorted_info = sorted(
        zip(gathered_info, credibility_scores),
        key=lambda x: x[1],
        reverse=True
    )

    context = "\n\n".join([
        f"[Source {i+1} | Credibility: {score:.1f}/5]\n{info['text']}"
        for i, (info, score) in enumerate(sorted_info)
    ])

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": """You are a research analyst. Synthesize the
provided sources into a comprehensive answer.

Guidelines:
- Cite sources by number [Source N]
- Note where sources agree or disagree
- Flag claims supported by only one source
- Prioritize high-credibility sources
- Acknowledge gaps in the evidence
- Structure the answer with clear sections"""
        }, {
            "role": "user",
            "content": f"""Research Question: {original_query}

Sources:
{context}"""
        }],
        temperature=0.2
    )

    return response.choices[0].message.content
★ Key Insight

The most effective synthesis prompts instruct the LLM to handle source disagreement explicitly rather than silently picking one version. When two high-credibility sources contradict each other, the system should present both perspectives with their supporting evidence and let the reader decide. This "epistemic honesty" approach builds far more user trust than confidently presenting a single answer that papers over genuine uncertainty.

6. Deep Research Architectures

Several production systems have implemented deep research capabilities that go well beyond simple agentic RAG. These systems typically combine query planning, multi-source retrieval, iterative refinement, and long-form synthesis into a unified workflow.

6.1 Architecture Comparison

Feature Naive RAG Agentic RAG Deep Research
Retrieval steps 1 2 to 5 10+
Sources Single vector store Multiple stores Web + docs + DB + APIs
Query planning None Decomposition Hierarchical plan tree
Self-evaluation None Sufficiency check Multi-criteria assessment
Output format Short answer Cited answer Structured report
Typical latency 2 to 5 seconds 10 to 30 seconds 1 to 10 minutes
Cost per query $0.01 to $0.05 $0.05 to $0.50 $0.50 to $5.00
⚠ Agentic RAG Failure Modes

Agentic RAG introduces new failure modes beyond those of naive RAG. Query drift occurs when follow-up queries gradually shift away from the original question, retrieving increasingly irrelevant information. Infinite loops occur when the agent never reaches a "sufficient" evaluation. Conflation occurs when the agent mixes information from different sub-queries, creating false associations. Mitigate these with hard iteration limits, query relevance checks against the original question, and explicit source tracking throughout the pipeline.

Deep Research: Multi-Phase Pipeline Plan Query tree + source plan Gather Parallel search across N sources Verify Cross-check + credibility Refine Fill gaps via follow-up queries Synthesize Structured report with citations iterate until sufficient Typical execution: 1 to 10 minutes, 10 to 50 LLM calls, multiple source types
Figure 19.11: Deep research pipelines chain planning, gathering, verification, refinement, and synthesis into a multi-phase workflow with iteration.

Section 19.4 Quiz

1. Why does a complex research question require iterative retrieval rather than a single retrieval step?
Show Answer
Complex questions often require multiple pieces of information that cannot be captured by a single query. They may need sequential resolution (the answer to one sub-question is needed to formulate the next), information from different source types (web, documents, databases), and verification through cross-referencing. A single retrieval step would miss these dependencies and return incomplete or tangentially relevant results.
2. What is the difference between parallel and sequential query decomposition?
Show Answer
Parallel decomposition produces independent sub-queries that can all be executed simultaneously (e.g., "What is country A's policy?" and "What is country B's policy?"). Sequential decomposition produces dependent sub-queries where each depends on the answer to the previous one (e.g., "Who are the top 5 GDP countries?" must be answered before "What is each country's carbon tax rate?"). Many complex queries involve a mix of both patterns.
3. What credibility signals should an agentic RAG system consider when evaluating sources?
Show Answer
Five key credibility signals: (1) Source authority (recognized expertise in the domain), (2) Recency (newer information for time-sensitive topics), (3) Consistency (corroboration across multiple independent sources), (4) Specificity (concrete data and citations vs. vague claims), and (5) Bias indicators (commercial interests, political slant, or advocacy motivations that might skew the information).
4. What is query drift and how can it be mitigated in agentic RAG?
Show Answer
Query drift occurs when follow-up queries gradually shift away from the original question, retrieving increasingly irrelevant information with each iteration. It happens because follow-up queries are generated based on recent context rather than the original question. Mitigation strategies include: always comparing follow-up queries against the original question for relevance, including the original question in every refinement prompt, setting hard iteration limits, and implementing a relevance score that measures how related each new result is to the original query.
5. How should a synthesis prompt handle conflicting information from multiple sources?
Show Answer
Rather than silently picking one version, the synthesis should present both perspectives with their supporting evidence and source citations. It should note which sources agree and disagree, highlight claims supported by only one source versus multiple sources, prioritize high-credibility sources, and let the reader make the final judgment. This "epistemic honesty" approach builds more user trust than confidently presenting a single answer that hides genuine disagreement.

Key Takeaways