Section 19.1: RAG Architecture & Fundamentals

★ Big Picture

RAG bridges the gap between what an LLM knows and what it needs to know. Rather than encoding all knowledge in model parameters, RAG retrieves relevant documents at inference time and injects them into the prompt. This simple idea yields enormous practical benefits: reduced hallucination, up-to-date information, domain-specific expertise, and full source attribution. Understanding the fundamental architecture, its failure modes, and when to choose RAG over fine-tuning is the foundation for everything else in this module.

1. Why Retrieval-Augmented Generation?

Large language models store knowledge implicitly in their parameters during pretraining. This parametric knowledge has three fundamental limitations. First, it has a knowledge cutoff: the model knows nothing about events after its training data was collected. Second, it is incomplete: no model can memorize every fact from its training corpus, especially rare or domain-specific information. Third, it is unverifiable: when a model generates a claim, there is no way to trace that claim back to a specific source document.

Retrieval-Augmented Generation, introduced by Lewis et al. (2020), addresses all three limitations by adding an explicit retrieval step before generation. The model receives both the user's query and a set of retrieved documents, then generates a response grounded in the retrieved evidence. This approach combines the generative fluency of LLMs with the factual precision of information retrieval systems.

1.1 The Core RAG Loop

Every RAG system follows the same fundamental loop: the user submits a query, the system retrieves relevant documents from a knowledge base, the retrieved documents are inserted into the LLM's context window along with the query, and the LLM generates a response grounded in the retrieved context. This loop can be as simple as a single retrieval step or as complex as a multi-turn agentic workflow with iterative refinement.

Figure 19.1: The naive RAG pipeline consists of four stages: query encoding, retrieval from a knowledge base, prompt construction, and LLM generation.

2. The Ingestion Pipeline

Before retrieval can happen, documents must be processed and indexed. The ingestion pipeline transforms raw documents (PDFs, web pages, databases, Markdown files) into searchable chunks stored in a vector database. The quality of this pipeline directly determines the quality of retrieved results, making it one of the most important components of any RAG system.

2.1 Document Loading and Preprocessing

The first step is loading documents from their source format and extracting clean text. This involves handling diverse formats (PDF, HTML, DOCX, CSV), removing boilerplate content (headers, footers, navigation), preserving meaningful structure (headings, tables, lists), and extracting metadata (title, author, date, source URL) for later filtering.

2.2 Chunking Strategies

Raw documents are typically too long to fit in a single retrieval result. Chunking splits documents into smaller segments that can be independently embedded and retrieved. The choice of chunking strategy profoundly affects retrieval quality: chunks that are too small lose context, while chunks that are too large dilute relevance and waste context window space.

Common Chunking Approaches

import tiktoken

def chunk_by_tokens(text, max_tokens=512, overlap=50):
    """Split text into chunks with token-level control."""
    encoder = tiktoken.encoding_for_model("gpt-4")
    tokens = encoder.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)
        chunks.append({
            "text": chunk_text,
            "token_count": len(chunk_tokens),
            "start_token": start
        })
        start = end - overlap  # Slide window with overlap

    return chunks


def chunk_by_structure(markdown_text):
    """Split markdown by headings, preserving document structure."""
    sections = []
    current_section = {"heading": "", "content": []}

    for line in markdown_text.split("\n"):
        if line.startswith("#"):
            if current_section["content"]:
                sections.append(current_section)
            current_section = {
                "heading": line.strip("# "),
                "content": []
            }
        else:
            current_section["content"].append(line)

    if current_section["content"]:
        sections.append(current_section)

    return [{
        "text": "\n".join(s["content"]),
        "metadata": {"heading": s["heading"]}
    } for s in sections]

ⓘ Chunking Best Practices

The optimal chunk size depends on your use case. For question-answering, 256 to 512 tokens works well because each chunk should contain a single coherent answer. For summarization, larger chunks (1024+ tokens) preserve more context. Always include overlap between consecutive chunks (10 to 15% of chunk size) to avoid splitting important information across chunk boundaries.

2.3 Embedding and Indexing

After chunking, each chunk is converted into a dense vector using an embedding model and stored in a vector database. The embedding model's quality is critical: it determines whether semantically similar queries and documents will have similar vector representations. Popular embedding models include OpenAI's text-embedding-3-small, Cohere's embed-v3, and open-source options like BAAI/bge-large-en-v1.5.

from openai import OpenAI
import chromadb

client = OpenAI()
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

def ingest_chunks(chunks, source_doc):
    """Embed and store chunks in ChromaDB."""
    texts = [c["text"] for c in chunks]

    # Batch embed (max 2048 texts per API call)
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    embeddings = [item.embedding for item in response.data]

    # Store with metadata for filtering
    collection.add(
        ids=[f"{source_doc}_chunk_{i}" for i in range(len(chunks))],
        embeddings=embeddings,
        documents=texts,
        metadatas=[{
            "source": source_doc,
            "chunk_index": i,
            "heading": chunks[i].get("metadata", {}).get("heading", "")
        } for i in range(len(chunks))]
    )
    return len(chunks)

3. Naive RAG: The Retrieve-and-Generate Pattern

The simplest RAG implementation follows a straightforward pattern: embed the user query, retrieve the top-k most similar chunks from the vector database, concatenate them into a prompt, and pass the augmented prompt to the LLM. Despite its simplicity, this "naive RAG" approach delivers substantial improvements over ungrounded generation for many use cases.

def naive_rag(query, k=5):
    """Simple retrieve-and-generate RAG pipeline."""

    # Step 1: Retrieve relevant chunks
    results = collection.query(
        query_texts=[query],
        n_results=k
    )
    retrieved_docs = results["documents"][0]
    sources = results["metadatas"][0]

    # Step 2: Build augmented prompt
    context = "\n\n---\n\n".join(
        [f"[Source: {s['source']}]\n{doc}"
         for doc, s in zip(retrieved_docs, sources)]
    )

    prompt = f"""Answer the question based on the provided context.
If the context does not contain enough information, say so clearly.
Cite the source documents used in your answer.

Context:
{context}

Question: {query}

Answer:"""

    # Step 3: Generate grounded response
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.1
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": sources,
        "num_chunks_used": len(retrieved_docs)
    }

Figure 19.2: The ingestion pipeline transforms raw documents into indexed vector embeddings through parsing, chunking, and embedding stages.

4. Context Window Management

Modern LLMs have large context windows (128K tokens for GPT-4o, 200K for Claude), but stuffing the entire context window with retrieved documents is rarely optimal. Research has revealed important patterns in how LLMs process long contexts that directly affect RAG system design.

4.1 The Lost-in-the-Middle Problem

Liu et al. (2024) demonstrated that LLMs attend more strongly to information at the beginning and end of their context, with reduced attention to content in the middle. This "U-shaped" attention pattern means that documents placed in the middle of a long context are more likely to be ignored. For RAG systems, this implies that simply concatenating many retrieved documents can actually hurt performance if the most relevant document ends up in the middle of the context.

⚠ Lost-in-the-Middle Effect

Experiments show that LLMs correctly use information placed at position 1 or position 20 in a list of 20 documents roughly 80% of the time, but performance drops to around 60% for documents at positions 8 through 12. To mitigate this effect: (1) limit the number of retrieved documents to 3 to 5, (2) place the most relevant document first, and (3) consider reranking by relevance before context insertion.

4.2 Optimal Context Sizing

def build_context_with_budget(retrieved_chunks, token_budget=4000):
    """Pack chunks into context respecting a token budget.
    Places highest-relevance chunks first (primacy effect)."""
    encoder = tiktoken.encoding_for_model("gpt-4o")
    context_parts = []
    total_tokens = 0

    for chunk in retrieved_chunks:  # Already sorted by relevance
        chunk_tokens = len(encoder.encode(chunk["text"]))
        if total_tokens + chunk_tokens > token_budget:
            break
        context_parts.append(chunk["text"])
        total_tokens += chunk_tokens

    return "\n\n---\n\n".join(context_parts), total_tokens

5. When RAG Beats Fine-Tuning

RAG and fine-tuning are complementary approaches to adapting LLMs, not competing ones. However, understanding when each approach is more appropriate helps practitioners avoid costly mistakes. The decision framework depends on several factors including knowledge volatility, the nature of the task, and available resources.

Factor	Favor RAG	Favor Fine-Tuning
Knowledge freshness	Data changes frequently (news, docs)	Stable knowledge domain
Source attribution	Citations required	Attribution not needed
Data volume	Large corpus (thousands of docs)	Small, curated dataset
Task type	Factual Q&A, search, research	Style adaptation, format control
Latency tolerance	Slight additional latency acceptable	Minimal latency required
Hallucination risk	Must be minimized with evidence	Acceptable with guardrails
Cost model	Per-query retrieval cost	One-time training cost

★ Key Insight

In practice, the best production systems often combine RAG and fine-tuning. Fine-tuning teaches the model how to use retrieved context effectively (following instructions, citing sources, admitting uncertainty), while RAG provides the what (the actual knowledge). This combination outperforms either approach alone for most enterprise applications.

6. Indexing Strategies for Large Corpora

When your knowledge base contains millions of documents, naive flat indexing becomes impractical. Several indexing strategies help maintain retrieval quality at scale.

6.1 Hierarchical Indexing

Hierarchical indexing creates multiple levels of abstraction. At the top level, document summaries are indexed. When a query matches a summary, the system then searches within that document's chunks for specific passages. This two-stage approach dramatically reduces the search space while maintaining recall.

6.2 Metadata Filtering

Adding metadata to chunks enables pre-retrieval filtering that narrows the search space before vector similarity is computed. Common metadata fields include document type, creation date, author, department, language, and topic tags. This filtering can be combined with vector search for efficient hybrid retrieval.

Figure 19.3: Hierarchical indexing narrows search from document summaries to section-level chunks, reducing search space while maintaining precision.

7. Evaluation and Common Failure Modes

Evaluating a RAG system requires measuring both retrieval quality and generation quality independently. The RAG triad framework assesses three dimensions: context relevance (did we retrieve the right documents?), groundedness (does the answer stick to the retrieved context?), and answer relevance (does the answer address the original question?).

7.1 Common Failure Modes

Retrieval failure: The correct document exists in the knowledge base but is not retrieved, often because the query and document use different terminology.
Context poisoning: Irrelevant or contradictory documents are retrieved, causing the LLM to generate incorrect answers grounded in bad context.
Lost in the middle: The relevant document is retrieved but placed in the middle of the context where the LLM pays less attention to it.
Abstention failure: The model generates a confident answer instead of admitting the context is insufficient to answer the question.
Context overflow: Too many retrieved chunks exceed the token budget, causing truncation or performance degradation.

ⓘ Evaluation Frameworks

Tools like RAGAS (Retrieval Augmented Generation Assessment), TruLens, and DeepEval provide automated metrics for evaluating RAG pipelines. RAGAS computes faithfulness, answer relevance, and context precision scores using LLM-as-judge approaches. For production systems, a combination of automated metrics and human evaluation on a golden test set provides the most reliable quality signal.

Section 19.1 Quiz

1. What are the three fundamental limitations of parametric knowledge in LLMs that RAG addresses?

Show Answer

Knowledge cutoff (the model has no information after its training date), incompleteness (no model can memorize every fact, especially rare or domain-specific information), and unverifiability (generated claims cannot be traced back to specific source documents). RAG addresses all three by grounding generation in retrieved evidence.

2. Why is chunk overlap important in the ingestion pipeline?

Show Answer

Chunk overlap (typically 10 to 15% of chunk size) prevents important information from being split across chunk boundaries. Without overlap, a sentence or concept that spans two chunks would be incomplete in both, reducing retrieval quality. The overlap ensures continuity so that key passages remain intact within at least one chunk.

3. What is the "lost-in-the-middle" problem and how does it affect RAG design?

Show Answer

LLMs attend more strongly to information at the beginning and end of their context window, with reduced attention to content in the middle (a U-shaped attention pattern). For RAG, this means: limit retrieved documents to 3 to 5 chunks, place the most relevant document first, and consider reranking by relevance before insertion to avoid burying critical information in the middle.

4. When should you choose RAG over fine-tuning for adapting an LLM?

Show Answer

Choose RAG when: knowledge changes frequently, source citations are required, the corpus is large (thousands of documents), the task involves factual Q&A or research, hallucination must be minimized with evidence, and you need to avoid the cost and delay of retraining. Fine-tuning is better for stable knowledge, style or format adaptation, and minimal-latency scenarios. The best systems often combine both.

5. What are the three dimensions of the RAG triad evaluation framework?

Show Answer

The RAG triad evaluates: (1) Context relevance, which measures whether the right documents were retrieved; (2) Groundedness (or faithfulness), which measures whether the generated answer stays faithful to the retrieved context; and (3) Answer relevance, which measures whether the answer actually addresses the user's original question. Measuring all three independently helps diagnose whether failures originate in retrieval, generation, or both.

Key Takeaways

RAG = Retrieve + Augment + Generate: The pattern retrieves relevant documents, injects them into the prompt, and generates grounded responses. This addresses LLM knowledge cutoffs, incompleteness, and unverifiability.
Ingestion quality determines retrieval quality: The chunking strategy, chunk size, overlap, metadata preservation, and embedding model choice all critically affect downstream retrieval performance.
Context window management matters: The lost-in-the-middle effect means that simply adding more documents can hurt performance. Limit to 3 to 5 high-quality chunks and place the most relevant ones first.
RAG and fine-tuning are complementary: Use fine-tuning to teach the model how to use context effectively; use RAG to supply the knowledge. The combination outperforms either alone.
Evaluate both retrieval and generation: The RAG triad (context relevance, groundedness, answer relevance) provides a comprehensive framework for diagnosing failures at each pipeline stage.