Naive RAG fails when the query and the relevant documents use different words, when top-k retrieval misses the best result, or when the model generates claims not supported by context. Advanced RAG techniques attack each of these failure modes: query transformation rewrites the query to improve retrieval, hybrid search combines dense and sparse signals, re-ranking uses powerful cross-encoders to refine initial results, and self-corrective approaches like CRAG and Self-RAG let the system verify and improve its own outputs. Mastering these techniques is the difference between a demo and a production system.
1. Query Transformation
The user's raw query is often a poor match for the retrieval index. Queries may be vague, use different terminology than the source documents, or bundle multiple sub-questions into one. Query transformation techniques rewrite, expand, or decompose the original query to improve retrieval recall and precision.
1.1 HyDE: Hypothetical Document Embeddings
HyDE (Gao et al., 2022) takes a counterintuitive approach: instead of embedding the query directly, it first asks the LLM to generate a hypothetical answer to the query, then embeds that hypothetical answer and uses it for retrieval. The intuition is that a hypothetical answer, even if factually incorrect, will be more lexically and semantically similar to real documents that contain the actual answer than the short query itself.
from openai import OpenAI client = OpenAI() def hyde_retrieve(query, collection, k=5): """HyDE: Generate hypothetical answer, embed it, retrieve.""" # Step 1: Generate a hypothetical document hypo_response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "system", "content": "Write a short passage that would answer the " "following question. Be specific and detailed." }, { "role": "user", "content": query }], temperature=0.7 ) hypothetical_doc = hypo_response.choices[0].message.content # Step 2: Retrieve using the hypothetical document results = collection.query( query_texts=[hypothetical_doc], n_results=k ) return results["documents"][0], results["metadatas"][0]
1.2 Multi-Query Expansion
Multi-query expansion generates several rephrased versions of the original query, retrieves results for each variant, and merges the result sets. This approach captures different phrasings and perspectives that might match different documents in the corpus.
def multi_query_retrieve(query, collection, k=5, num_variants=3): """Generate multiple query variants and merge results.""" # Generate query variants response = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "system", "content": f"Generate {num_variants} alternative phrasings of " "the following search query. Return one per line." }, { "role": "user", "content": query }] ) variants = response.choices[0].message.content.strip().split("\n") all_queries = [query] + variants # Retrieve for each variant seen_ids = set() merged_results = [] for q in all_queries: results = collection.query(query_texts=[q], n_results=k) for doc, meta, doc_id in zip( results["documents"][0], results["metadatas"][0], results["ids"][0] ): if doc_id not in seen_ids: seen_ids.add(doc_id) merged_results.append({ "document": doc, "metadata": meta }) return merged_results[:k * 2] # Return expanded set
1.3 Step-Back Prompting
Step-back prompting (Zheng et al., 2023) generates a more abstract or general version of the query before retrieval. For example, the query "What was the GDP growth rate of Japan in Q3 2024?" might be stepped back to "What are the recent economic trends in Japan?" The broader query retrieves documents that provide necessary background context, which is then combined with results from the specific query.
2. Hybrid Retrieval: Dense + Sparse
Dense retrieval (embedding similarity) excels at semantic matching but can miss exact keyword matches. Sparse retrieval (BM25) excels at keyword matching but misses semantic relationships. Hybrid retrieval combines both signals, typically using Reciprocal Rank Fusion (RRF) to merge the ranked result lists.
2.1 BM25 for Sparse Retrieval
BM25 is a term-frequency scoring function that has been the backbone of search engines for decades. It assigns higher scores to documents containing query terms that are rare in the corpus (high IDF) and that appear frequently in the specific document (high TF), with saturation to prevent long documents from dominating.
from rank_bm25 import BM25Okapi import numpy as np class HybridRetriever: """Combine dense (vector) and sparse (BM25) retrieval.""" def __init__(self, documents, collection): self.documents = documents self.collection = collection # ChromaDB collection # Build BM25 index tokenized = [doc.lower().split() for doc in documents] self.bm25 = BM25Okapi(tokenized) def retrieve(self, query, k=5, alpha=0.5): """Hybrid retrieval with Reciprocal Rank Fusion. Args: alpha: Weight for dense results (1-alpha for sparse). """ # Dense retrieval dense_results = self.collection.query( query_texts=[query], n_results=k * 2 ) dense_ids = dense_results["ids"][0] # Sparse retrieval (BM25) tokenized_query = query.lower().split() bm25_scores = self.bm25.get_scores(tokenized_query) sparse_top = np.argsort(bm25_scores)[::-1][:k * 2] # Reciprocal Rank Fusion rrf_scores = {} rrf_k = 60 # Standard RRF constant for rank, doc_id in enumerate(dense_ids): rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) rrf_scores[doc_id] += alpha / (rrf_k + rank + 1) for rank, idx in enumerate(sparse_top): doc_id = f"doc_{idx}" rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) rrf_scores[doc_id] += (1 - alpha) / (rrf_k + rank + 1) # Sort by fused score ranked = sorted( rrf_scores.items(), key=lambda x: x[1], reverse=True ) return ranked[:k]
Hybrid retrieval consistently outperforms either dense or sparse retrieval alone across benchmarks. In the BEIR benchmark, combining BM25 with a dense retriever using RRF improved NDCG@10 by 5 to 15% compared to using either method alone. The gains are largest on technical domains where exact terminology matters (legal, medical, code) but dense semantic understanding is also needed.
3. Re-Ranking with Cross-Encoders
Initial retrieval (whether dense, sparse, or hybrid) uses fast but approximate scoring. Re-ranking applies a more powerful but slower model to the initial candidate set. Cross-encoder models are particularly effective because they jointly encode the query and document together, enabling fine-grained interaction between query and passage tokens.
3.1 How Cross-Encoders Differ from Bi-Encoders
Bi-encoders (used for initial retrieval) encode the query and document independently, then compute similarity via dot product or cosine. This allows pre-computing document embeddings but limits interaction between query and document representations. Cross-encoders encode the query and document as a single concatenated input, enabling full token-level attention between them. This produces much more accurate relevance scores but requires running inference for every (query, document) pair, making it too slow for searching millions of documents.
3.2 Using Cohere Rerank
import cohere co = cohere.ClientV2("YOUR_API_KEY") def rerank_results(query, documents, top_n=5): """Re-rank retrieved documents using Cohere Rerank.""" response = co.rerank( model="rerank-v3.5", query=query, documents=documents, top_n=top_n, return_documents=True ) reranked = [] for result in response.results: reranked.append({ "text": result.document.text, "relevance_score": result.relevance_score, "original_index": result.index }) return reranked
4. Contextual Retrieval
Standard chunking strips away the surrounding context that gives a chunk its meaning. A chunk reading "The company reported 15% growth" is ambiguous without knowing which company, which metric, and which time period. Contextual retrieval (Anthropic, 2024) prepends each chunk with a short contextual summary generated by an LLM, creating self-contained chunks that embed and retrieve much more accurately.
Anthropic's experiments showed that contextual retrieval reduced retrieval failure rates by 49% compared to standard chunking, and by 67% when combined with BM25 hybrid search. The contextual prefix is typically 50 to 100 tokens describing the document title, section heading, and the chunk's role within the broader document. This prefix is included when embedding but can be omitted when presenting the chunk to the LLM for generation.
5. Self-Corrective RAG
Standard RAG blindly trusts the retrieval results and generates from whatever context is provided. Self-corrective RAG systems evaluate the quality of retrieved documents and the faithfulness of generated answers, triggering corrective actions when problems are detected.
5.1 CRAG: Corrective Retrieval-Augmented Generation
CRAG (Yan et al., 2024) adds a retrieval evaluator that classifies each retrieved document as "correct," "incorrect," or "ambiguous." If all documents are incorrect, the system falls back to web search. If documents are ambiguous, the system refines the query and re-retrieves. Only when documents are classified as correct does generation proceed normally.
5.2 Self-RAG
Self-RAG (Asai et al., 2023) trains the LLM itself to generate special reflection tokens that assess whether retrieval is needed, whether retrieved passages are relevant, whether the generated response is supported by the evidence, and whether the response is useful. These self-assessments allow the model to adaptively decide when to retrieve, which passages to use, and when to regenerate.
6. Fusion Retrieval and Multi-Modal RAG
Fusion retrieval goes beyond combining dense and sparse signals. RAG Fusion (Raudaschl, 2023) generates multiple search queries, retrieves results for each, and applies RRF across all result sets. This approach captures diverse perspectives on the query and is particularly effective for complex, multi-faceted questions.
6.1 Multi-Modal RAG
Multi-modal RAG extends retrieval beyond text to include images, tables, charts, and diagrams. This is essential for domains where critical information is encoded visually, such as scientific papers (figures and plots), financial reports (tables and charts), or technical documentation (architecture diagrams). Vision-language models like GPT-4o and Claude can process both retrieved text and images in their context window.
Multi-modal RAG introduces several unique challenges: (1) embedding images and text into a shared vector space is still an active research area, with models like CLIP providing only coarse alignment; (2) table extraction from PDFs is error-prone, often requiring specialized tools; (3) the token cost of including images in the context is high (a single image may consume 500+ tokens); and (4) evaluation is more complex because both visual and textual relevance must be assessed.
7. Comparison of Advanced RAG Techniques
| Technique | What It Fixes | Latency Cost | Best For |
|---|---|---|---|
| HyDE | Query-document vocabulary gap | +1 LLM call | Technical/domain queries |
| Multi-Query | Single-perspective retrieval | +1 LLM call, N retrievals | Ambiguous or broad queries |
| Step-Back | Missing background context | +1 LLM call, 2 retrievals | Specific factual questions |
| BM25 Hybrid | Missed keyword matches | Minimal (BM25 is fast) | Technical, legal, medical |
| Cross-Encoder Rerank | Imprecise initial ranking | +N model inferences | High-precision applications |
| Contextual Retrieval | Context-stripped chunks | Ingestion-time LLM cost | Large document corpora |
| CRAG / Self-RAG | Blind trust in bad retrieval | +1 to 3 LLM calls | Safety-critical applications |
Section 19.2 Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Query transformation bridges the vocabulary gap: HyDE, multi-query, and step-back prompting each address different causes of retrieval failure by rewriting the query before it reaches the index.
- Hybrid retrieval is almost always better: Combining dense and sparse (BM25) retrieval with Reciprocal Rank Fusion consistently outperforms either method alone, especially in technical domains.
- Re-ranking is high-impact and low-effort: Adding a cross-encoder re-ranker on top of initial retrieval is one of the highest-ROI improvements you can make to a RAG pipeline.
- Contextual retrieval makes chunks self-contained: Prepending LLM-generated context to chunks at ingestion time reduces retrieval failures by 49% (67% with BM25 hybrid).
- Self-corrective RAG prevents blind trust: CRAG and Self-RAG evaluate retrieval quality and generation faithfulness, triggering corrective actions when problems are detected.