RAG bridges the gap between what an LLM knows and what it needs to know. Rather than encoding all knowledge in model parameters, RAG retrieves relevant documents at inference time and injects them into the prompt. This simple idea yields enormous practical benefits: reduced hallucination, up-to-date information, domain-specific expertise, and full source attribution. Understanding the fundamental architecture, its failure modes, and when to choose RAG over fine-tuning is the foundation for everything else in this module.
1. Why Retrieval-Augmented Generation?
Large language models store knowledge implicitly in their parameters during pretraining. This parametric knowledge has three fundamental limitations. First, it has a knowledge cutoff: the model knows nothing about events after its training data was collected. Second, it is incomplete: no model can memorize every fact from its training corpus, especially rare or domain-specific information. Third, it is unverifiable: when a model generates a claim, there is no way to trace that claim back to a specific source document.
Retrieval-Augmented Generation, introduced by Lewis et al. (2020), addresses all three limitations by adding an explicit retrieval step before generation. The model receives both the user's query and a set of retrieved documents, then generates a response grounded in the retrieved evidence. This approach combines the generative fluency of LLMs with the factual precision of information retrieval systems.
1.1 The Core RAG Loop
Every RAG system follows the same fundamental loop: the user submits a query, the system retrieves relevant documents from a knowledge base, the retrieved documents are inserted into the LLM's context window along with the query, and the LLM generates a response grounded in the retrieved context. This loop can be as simple as a single retrieval step or as complex as a multi-turn agentic workflow with iterative refinement.
2. The Ingestion Pipeline
Before retrieval can happen, documents must be processed and indexed. The ingestion pipeline transforms raw documents (PDFs, web pages, databases, Markdown files) into searchable chunks stored in a vector database. The quality of this pipeline directly determines the quality of retrieved results, making it one of the most important components of any RAG system.
2.1 Document Loading and Preprocessing
The first step is loading documents from their source format and extracting clean text. This involves handling diverse formats (PDF, HTML, DOCX, CSV), removing boilerplate content (headers, footers, navigation), preserving meaningful structure (headings, tables, lists), and extracting metadata (title, author, date, source URL) for later filtering.
2.2 Chunking Strategies
Raw documents are typically too long to fit in a single retrieval result. Chunking splits documents into smaller segments that can be independently embedded and retrieved. The choice of chunking strategy profoundly affects retrieval quality: chunks that are too small lose context, while chunks that are too large dilute relevance and waste context window space.
Common Chunking Approaches
import tiktoken def chunk_by_tokens(text, max_tokens=512, overlap=50): """Split text into chunks with token-level control.""" encoder = tiktoken.encoding_for_model("gpt-4") tokens = encoder.encode(text) chunks = [] start = 0 while start < len(tokens): end = start + max_tokens chunk_tokens = tokens[start:end] chunk_text = encoder.decode(chunk_tokens) chunks.append({ "text": chunk_text, "token_count": len(chunk_tokens), "start_token": start }) start = end - overlap # Slide window with overlap return chunks def chunk_by_structure(markdown_text): """Split markdown by headings, preserving document structure.""" sections = [] current_section = {"heading": "", "content": []} for line in markdown_text.split("\n"): if line.startswith("#"): if current_section["content"]: sections.append(current_section) current_section = { "heading": line.strip("# "), "content": [] } else: current_section["content"].append(line) if current_section["content"]: sections.append(current_section) return [{ "text": "\n".join(s["content"]), "metadata": {"heading": s["heading"]} } for s in sections]
The optimal chunk size depends on your use case. For question-answering, 256 to 512 tokens works well because each chunk should contain a single coherent answer. For summarization, larger chunks (1024+ tokens) preserve more context. Always include overlap between consecutive chunks (10 to 15% of chunk size) to avoid splitting important information across chunk boundaries.
2.3 Embedding and Indexing
After chunking, each chunk is converted into a dense vector using an embedding model and stored in
a vector database. The embedding model's quality is critical: it determines whether semantically
similar queries and documents will have similar vector representations. Popular embedding models
include OpenAI's text-embedding-3-small, Cohere's embed-v3, and open-source
options like BAAI/bge-large-en-v1.5.
from openai import OpenAI import chromadb client = OpenAI() chroma = chromadb.PersistentClient(path="./chroma_db") collection = chroma.get_or_create_collection( name="documents", metadata={"hnsw:space": "cosine"} ) def ingest_chunks(chunks, source_doc): """Embed and store chunks in ChromaDB.""" texts = [c["text"] for c in chunks] # Batch embed (max 2048 texts per API call) response = client.embeddings.create( model="text-embedding-3-small", input=texts ) embeddings = [item.embedding for item in response.data] # Store with metadata for filtering collection.add( ids=[f"{source_doc}_chunk_{i}" for i in range(len(chunks))], embeddings=embeddings, documents=texts, metadatas=[{ "source": source_doc, "chunk_index": i, "heading": chunks[i].get("metadata", {}).get("heading", "") } for i in range(len(chunks))] ) return len(chunks)
3. Naive RAG: The Retrieve-and-Generate Pattern
The simplest RAG implementation follows a straightforward pattern: embed the user query, retrieve the top-k most similar chunks from the vector database, concatenate them into a prompt, and pass the augmented prompt to the LLM. Despite its simplicity, this "naive RAG" approach delivers substantial improvements over ungrounded generation for many use cases.
def naive_rag(query, k=5): """Simple retrieve-and-generate RAG pipeline.""" # Step 1: Retrieve relevant chunks results = collection.query( query_texts=[query], n_results=k ) retrieved_docs = results["documents"][0] sources = results["metadatas"][0] # Step 2: Build augmented prompt context = "\n\n---\n\n".join( [f"[Source: {s['source']}]\n{doc}" for doc, s in zip(retrieved_docs, sources)] ) prompt = f"""Answer the question based on the provided context. If the context does not contain enough information, say so clearly. Cite the source documents used in your answer. Context: {context} Question: {query} Answer:""" # Step 3: Generate grounded response response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=0.1 ) return { "answer": response.choices[0].message.content, "sources": sources, "num_chunks_used": len(retrieved_docs) }
4. Context Window Management
Modern LLMs have large context windows (128K tokens for GPT-4o, 200K for Claude), but stuffing the entire context window with retrieved documents is rarely optimal. Research has revealed important patterns in how LLMs process long contexts that directly affect RAG system design.
4.1 The Lost-in-the-Middle Problem
Liu et al. (2024) demonstrated that LLMs attend more strongly to information at the beginning and end of their context, with reduced attention to content in the middle. This "U-shaped" attention pattern means that documents placed in the middle of a long context are more likely to be ignored. For RAG systems, this implies that simply concatenating many retrieved documents can actually hurt performance if the most relevant document ends up in the middle of the context.
Experiments show that LLMs correctly use information placed at position 1 or position 20 in a list of 20 documents roughly 80% of the time, but performance drops to around 60% for documents at positions 8 through 12. To mitigate this effect: (1) limit the number of retrieved documents to 3 to 5, (2) place the most relevant document first, and (3) consider reranking by relevance before context insertion.
4.2 Optimal Context Sizing
def build_context_with_budget(retrieved_chunks, token_budget=4000): """Pack chunks into context respecting a token budget. Places highest-relevance chunks first (primacy effect).""" encoder = tiktoken.encoding_for_model("gpt-4o") context_parts = [] total_tokens = 0 for chunk in retrieved_chunks: # Already sorted by relevance chunk_tokens = len(encoder.encode(chunk["text"])) if total_tokens + chunk_tokens > token_budget: break context_parts.append(chunk["text"]) total_tokens += chunk_tokens return "\n\n---\n\n".join(context_parts), total_tokens
5. When RAG Beats Fine-Tuning
RAG and fine-tuning are complementary approaches to adapting LLMs, not competing ones. However, understanding when each approach is more appropriate helps practitioners avoid costly mistakes. The decision framework depends on several factors including knowledge volatility, the nature of the task, and available resources.
| Factor | Favor RAG | Favor Fine-Tuning |
|---|---|---|
| Knowledge freshness | Data changes frequently (news, docs) | Stable knowledge domain |
| Source attribution | Citations required | Attribution not needed |
| Data volume | Large corpus (thousands of docs) | Small, curated dataset |
| Task type | Factual Q&A, search, research | Style adaptation, format control |
| Latency tolerance | Slight additional latency acceptable | Minimal latency required |
| Hallucination risk | Must be minimized with evidence | Acceptable with guardrails |
| Cost model | Per-query retrieval cost | One-time training cost |
In practice, the best production systems often combine RAG and fine-tuning. Fine-tuning teaches the model how to use retrieved context effectively (following instructions, citing sources, admitting uncertainty), while RAG provides the what (the actual knowledge). This combination outperforms either approach alone for most enterprise applications.
6. Indexing Strategies for Large Corpora
When your knowledge base contains millions of documents, naive flat indexing becomes impractical. Several indexing strategies help maintain retrieval quality at scale.
6.1 Hierarchical Indexing
Hierarchical indexing creates multiple levels of abstraction. At the top level, document summaries are indexed. When a query matches a summary, the system then searches within that document's chunks for specific passages. This two-stage approach dramatically reduces the search space while maintaining recall.
6.2 Metadata Filtering
Adding metadata to chunks enables pre-retrieval filtering that narrows the search space before vector similarity is computed. Common metadata fields include document type, creation date, author, department, language, and topic tags. This filtering can be combined with vector search for efficient hybrid retrieval.
7. Evaluation and Common Failure Modes
Evaluating a RAG system requires measuring both retrieval quality and generation quality independently. The RAG triad framework assesses three dimensions: context relevance (did we retrieve the right documents?), groundedness (does the answer stick to the retrieved context?), and answer relevance (does the answer address the original question?).
7.1 Common Failure Modes
- Retrieval failure: The correct document exists in the knowledge base but is not retrieved, often because the query and document use different terminology.
- Context poisoning: Irrelevant or contradictory documents are retrieved, causing the LLM to generate incorrect answers grounded in bad context.
- Lost in the middle: The relevant document is retrieved but placed in the middle of the context where the LLM pays less attention to it.
- Abstention failure: The model generates a confident answer instead of admitting the context is insufficient to answer the question.
- Context overflow: Too many retrieved chunks exceed the token budget, causing truncation or performance degradation.
Tools like RAGAS (Retrieval Augmented Generation Assessment), TruLens, and DeepEval provide automated metrics for evaluating RAG pipelines. RAGAS computes faithfulness, answer relevance, and context precision scores using LLM-as-judge approaches. For production systems, a combination of automated metrics and human evaluation on a golden test set provides the most reliable quality signal.
Section 19.1 Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- RAG = Retrieve + Augment + Generate: The pattern retrieves relevant documents, injects them into the prompt, and generates grounded responses. This addresses LLM knowledge cutoffs, incompleteness, and unverifiability.
- Ingestion quality determines retrieval quality: The chunking strategy, chunk size, overlap, metadata preservation, and embedding model choice all critically affect downstream retrieval performance.
- Context window management matters: The lost-in-the-middle effect means that simply adding more documents can hurt performance. Limit to 3 to 5 high-quality chunks and place the most relevant ones first.
- RAG and fine-tuning are complementary: Use fine-tuning to teach the model how to use context effectively; use RAG to supply the knowledge. The combination outperforms either alone.
- Evaluate both retrieval and generation: The RAG triad (context relevance, groundedness, answer relevance) provides a comprehensive framework for diagnosing failures at each pipeline stage.