Section 19.2: Advanced RAG Techniques

★ Big Picture

Naive RAG fails when the query and the relevant documents use different words, when top-k retrieval misses the best result, or when the model generates claims not supported by context. Advanced RAG techniques attack each of these failure modes: query transformation rewrites the query to improve retrieval, hybrid search combines dense and sparse signals, re-ranking uses powerful cross-encoders to refine initial results, and self-corrective approaches like CRAG and Self-RAG let the system verify and improve its own outputs. Mastering these techniques is the difference between a demo and a production system.

1. Query Transformation

The user's raw query is often a poor match for the retrieval index. Queries may be vague, use different terminology than the source documents, or bundle multiple sub-questions into one. Query transformation techniques rewrite, expand, or decompose the original query to improve retrieval recall and precision.

1.1 HyDE: Hypothetical Document Embeddings

HyDE (Gao et al., 2022) takes a counterintuitive approach: instead of embedding the query directly, it first asks the LLM to generate a hypothetical answer to the query, then embeds that hypothetical answer and uses it for retrieval. The intuition is that a hypothetical answer, even if factually incorrect, will be more lexically and semantically similar to real documents that contain the actual answer than the short query itself.

from openai import OpenAI

client = OpenAI()

def hyde_retrieve(query, collection, k=5):
    """HyDE: Generate hypothetical answer, embed it, retrieve."""

    # Step 1: Generate a hypothetical document
    hypo_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Write a short passage that would answer the "
                       "following question. Be specific and detailed."
        }, {
            "role": "user",
            "content": query
        }],
        temperature=0.7
    )
    hypothetical_doc = hypo_response.choices[0].message.content

    # Step 2: Retrieve using the hypothetical document
    results = collection.query(
        query_texts=[hypothetical_doc],
        n_results=k
    )

    return results["documents"][0], results["metadatas"][0]

1.2 Multi-Query Expansion

Multi-query expansion generates several rephrased versions of the original query, retrieves results for each variant, and merges the result sets. This approach captures different phrasings and perspectives that might match different documents in the corpus.

def multi_query_retrieve(query, collection, k=5, num_variants=3):
    """Generate multiple query variants and merge results."""

    # Generate query variants
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": f"Generate {num_variants} alternative phrasings of "
                       "the following search query. Return one per line."
        }, {
            "role": "user",
            "content": query
        }]
    )
    variants = response.choices[0].message.content.strip().split("\n")
    all_queries = [query] + variants

    # Retrieve for each variant
    seen_ids = set()
    merged_results = []

    for q in all_queries:
        results = collection.query(query_texts=[q], n_results=k)
        for doc, meta, doc_id in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["ids"][0]
        ):
            if doc_id not in seen_ids:
                seen_ids.add(doc_id)
                merged_results.append({
                    "document": doc,
                    "metadata": meta
                })

    return merged_results[:k * 2]  # Return expanded set

1.3 Step-Back Prompting

Step-back prompting (Zheng et al., 2023) generates a more abstract or general version of the query before retrieval. For example, the query "What was the GDP growth rate of Japan in Q3 2024?" might be stepped back to "What are the recent economic trends in Japan?" The broader query retrieves documents that provide necessary background context, which is then combined with results from the specific query.

Figure 19.4: Three query transformation strategies, each addressing different causes of retrieval failure.

2. Hybrid Retrieval: Dense + Sparse

Dense retrieval (embedding similarity) excels at semantic matching but can miss exact keyword matches. Sparse retrieval (BM25) excels at keyword matching but misses semantic relationships. Hybrid retrieval combines both signals, typically using Reciprocal Rank Fusion (RRF) to merge the ranked result lists.

2.1 BM25 for Sparse Retrieval

BM25 is a term-frequency scoring function that has been the backbone of search engines for decades. It assigns higher scores to documents containing query terms that are rare in the corpus (high IDF) and that appear frequently in the specific document (high TF), with saturation to prevent long documents from dominating.

from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    """Combine dense (vector) and sparse (BM25) retrieval."""

    def __init__(self, documents, collection):
        self.documents = documents
        self.collection = collection  # ChromaDB collection

        # Build BM25 index
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query, k=5, alpha=0.5):
        """Hybrid retrieval with Reciprocal Rank Fusion.

        Args:
            alpha: Weight for dense results (1-alpha for sparse).
        """
        # Dense retrieval
        dense_results = self.collection.query(
            query_texts=[query], n_results=k * 2
        )
        dense_ids = dense_results["ids"][0]

        # Sparse retrieval (BM25)
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        sparse_top = np.argsort(bm25_scores)[::-1][:k * 2]

        # Reciprocal Rank Fusion
        rrf_scores = {}
        rrf_k = 60  # Standard RRF constant

        for rank, doc_id in enumerate(dense_ids):
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0)
            rrf_scores[doc_id] += alpha / (rrf_k + rank + 1)

        for rank, idx in enumerate(sparse_top):
            doc_id = f"doc_{idx}"
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0)
            rrf_scores[doc_id] += (1 - alpha) / (rrf_k + rank + 1)

        # Sort by fused score
        ranked = sorted(
            rrf_scores.items(),
            key=lambda x: x[1],
            reverse=True
        )
        return ranked[:k]

★ Key Insight

Hybrid retrieval consistently outperforms either dense or sparse retrieval alone across benchmarks. In the BEIR benchmark, combining BM25 with a dense retriever using RRF improved NDCG@10 by 5 to 15% compared to using either method alone. The gains are largest on technical domains where exact terminology matters (legal, medical, code) but dense semantic understanding is also needed.

3. Re-Ranking with Cross-Encoders

Initial retrieval (whether dense, sparse, or hybrid) uses fast but approximate scoring. Re-ranking applies a more powerful but slower model to the initial candidate set. Cross-encoder models are particularly effective because they jointly encode the query and document together, enabling fine-grained interaction between query and passage tokens.

3.1 How Cross-Encoders Differ from Bi-Encoders

Bi-encoders (used for initial retrieval) encode the query and document independently, then compute similarity via dot product or cosine. This allows pre-computing document embeddings but limits interaction between query and document representations. Cross-encoders encode the query and document as a single concatenated input, enabling full token-level attention between them. This produces much more accurate relevance scores but requires running inference for every (query, document) pair, making it too slow for searching millions of documents.

Figure 19.5: Bi-encoders enable fast retrieval by encoding independently; cross-encoders enable accurate re-ranking through joint encoding.

3.2 Using Cohere Rerank

import cohere

co = cohere.ClientV2("YOUR_API_KEY")

def rerank_results(query, documents, top_n=5):
    """Re-rank retrieved documents using Cohere Rerank."""
    response = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True
    )

    reranked = []
    for result in response.results:
        reranked.append({
            "text": result.document.text,
            "relevance_score": result.relevance_score,
            "original_index": result.index
        })

    return reranked

4. Contextual Retrieval

Standard chunking strips away the surrounding context that gives a chunk its meaning. A chunk reading "The company reported 15% growth" is ambiguous without knowing which company, which metric, and which time period. Contextual retrieval (Anthropic, 2024) prepends each chunk with a short contextual summary generated by an LLM, creating self-contained chunks that embed and retrieve much more accurately.

ⓘ Contextual Retrieval in Practice

Anthropic's experiments showed that contextual retrieval reduced retrieval failure rates by 49% compared to standard chunking, and by 67% when combined with BM25 hybrid search. The contextual prefix is typically 50 to 100 tokens describing the document title, section heading, and the chunk's role within the broader document. This prefix is included when embedding but can be omitted when presenting the chunk to the LLM for generation.

5. Self-Corrective RAG

Standard RAG blindly trusts the retrieval results and generates from whatever context is provided. Self-corrective RAG systems evaluate the quality of retrieved documents and the faithfulness of generated answers, triggering corrective actions when problems are detected.

5.1 CRAG: Corrective Retrieval-Augmented Generation

CRAG (Yan et al., 2024) adds a retrieval evaluator that classifies each retrieved document as "correct," "incorrect," or "ambiguous." If all documents are incorrect, the system falls back to web search. If documents are ambiguous, the system refines the query and re-retrieves. Only when documents are classified as correct does generation proceed normally.

5.2 Self-RAG

Self-RAG (Asai et al., 2023) trains the LLM itself to generate special reflection tokens that assess whether retrieval is needed, whether retrieved passages are relevant, whether the generated response is supported by the evidence, and whether the response is useful. These self-assessments allow the model to adaptively decide when to retrieve, which passages to use, and when to regenerate.

Figure 19.6: CRAG evaluates retrieved documents and branches into three correction paths based on retrieval quality.

6. Fusion Retrieval and Multi-Modal RAG

Fusion retrieval goes beyond combining dense and sparse signals. RAG Fusion (Raudaschl, 2023) generates multiple search queries, retrieves results for each, and applies RRF across all result sets. This approach captures diverse perspectives on the query and is particularly effective for complex, multi-faceted questions.

6.1 Multi-Modal RAG

Multi-modal RAG extends retrieval beyond text to include images, tables, charts, and diagrams. This is essential for domains where critical information is encoded visually, such as scientific papers (figures and plots), financial reports (tables and charts), or technical documentation (architecture diagrams). Vision-language models like GPT-4o and Claude can process both retrieved text and images in their context window.

⚠ Multi-Modal RAG Challenges

Multi-modal RAG introduces several unique challenges: (1) embedding images and text into a shared vector space is still an active research area, with models like CLIP providing only coarse alignment; (2) table extraction from PDFs is error-prone, often requiring specialized tools; (3) the token cost of including images in the context is high (a single image may consume 500+ tokens); and (4) evaluation is more complex because both visual and textual relevance must be assessed.

7. Comparison of Advanced RAG Techniques

Technique	What It Fixes	Latency Cost	Best For
HyDE	Query-document vocabulary gap	+1 LLM call	Technical/domain queries
Multi-Query	Single-perspective retrieval	+1 LLM call, N retrievals	Ambiguous or broad queries
Step-Back	Missing background context	+1 LLM call, 2 retrievals	Specific factual questions
BM25 Hybrid	Missed keyword matches	Minimal (BM25 is fast)	Technical, legal, medical
Cross-Encoder Rerank	Imprecise initial ranking	+N model inferences	High-precision applications
Contextual Retrieval	Context-stripped chunks	Ingestion-time LLM cost	Large document corpora
CRAG / Self-RAG	Blind trust in bad retrieval	+1 to 3 LLM calls	Safety-critical applications

Section 19.2 Quiz

1. How does HyDE improve retrieval compared to directly embedding the user query?

Show Answer

HyDE generates a hypothetical answer to the query using an LLM, then embeds this hypothetical document for retrieval instead of the raw query. The hypothetical answer is longer and more semantically similar to real documents than the short query, bridging the vocabulary and length gap between queries and documents. Even if the hypothetical answer is factually wrong, it uses the same style, terminology, and structure as real documents in the index.

2. Why does hybrid retrieval (dense + BM25) outperform either method alone?

Show Answer

Dense retrieval captures semantic similarity (paraphrases, synonyms, conceptual matches) but can miss exact keyword matches. BM25 captures exact term matches and handles rare terms well but misses semantic relationships. By combining both with Reciprocal Rank Fusion (RRF), hybrid retrieval gets the best of both worlds: documents that are semantically relevant and those that contain exact query terms both contribute to the final ranking.

3. Why are cross-encoders more accurate than bi-encoders for relevance scoring?

Show Answer

Cross-encoders encode the query and document as a single concatenated input, enabling full token-level attention between query and document tokens. This allows fine-grained interaction and comparison. Bi-encoders encode query and document independently, computing similarity only through a simple dot product or cosine of the final vectors. The trade-off is that cross-encoders are too slow for searching millions of documents, so they are used only for re-ranking a small candidate set (typically 20 to 100 documents).

4. How does CRAG differ from standard RAG in handling retrieval failures?

Show Answer

Standard RAG blindly trusts whatever documents are retrieved and generates from them regardless of quality. CRAG adds a retrieval evaluator that classifies each document as correct, ambiguous, or incorrect. If documents are correct, generation proceeds normally. If ambiguous, the query is refined and retrieval is repeated. If incorrect, the system falls back to web search. This three-way branching prevents the model from generating answers grounded in irrelevant or misleading context.

5. What problem does contextual retrieval solve, and what is the cost?

Show Answer

Contextual retrieval solves the problem of context-stripped chunks that lose their meaning when isolated from surrounding text. It prepends each chunk with an LLM-generated contextual summary (50 to 100 tokens) describing the document, section, and the chunk's role. This makes chunks self-contained for embedding and retrieval. The cost is an additional LLM call per chunk during ingestion (not at query time), which can be significant for large corpora but is a one-time expense.

Key Takeaways

Query transformation bridges the vocabulary gap: HyDE, multi-query, and step-back prompting each address different causes of retrieval failure by rewriting the query before it reaches the index.
Hybrid retrieval is almost always better: Combining dense and sparse (BM25) retrieval with Reciprocal Rank Fusion consistently outperforms either method alone, especially in technical domains.
Re-ranking is high-impact and low-effort: Adding a cross-encoder re-ranker on top of initial retrieval is one of the highest-ROI improvements you can make to a RAG pipeline.
Contextual retrieval makes chunks self-contained: Prepending LLM-generated context to chunks at ingestion time reduces retrieval failures by 49% (67% with BM25 hybrid).
Self-corrective RAG prevents blind trust: CRAG and Self-RAG evaluate retrieval quality and generation faithfulness, triggering corrective actions when problems are detected.