Section 18.4: Document Processing & Chunking

★ Big Picture

The quality of your RAG system is bounded by the quality of your chunks. No embedding model or vector database can compensate for poorly chunked documents. If a relevant answer spans two chunks that were split in the wrong place, the retriever will never surface it as a single coherent result. Document processing and chunking is where most RAG systems succeed or fail, yet it receives far less attention than model selection or index tuning. This section covers chunking strategies from basic to advanced, document parsing tools for complex formats, and the engineering of production-grade ingestion pipelines.

1. The Document Processing Pipeline

Before text can be embedded and indexed, raw documents must pass through a multi-stage processing pipeline. Each stage introduces potential failure modes that can degrade retrieval quality downstream.

Loading: Reading raw files from various sources (file systems, S3, URLs, databases, APIs).
Parsing: Extracting text and structure from complex formats (PDF, DOCX, HTML, slides, scanned images).
Cleaning: Removing headers, footers, page numbers, boilerplate, and artifacts from parsing.
Chunking: Splitting cleaned text into segments suitable for embedding and retrieval.
Enrichment: Adding metadata (source, page number, section title, date) to each chunk.
Embedding: Converting chunks to vectors using the selected embedding model.
Indexing: Storing vectors and metadata in the vector database.

Figure 18.8: The RAG document ingestion pipeline from raw files to indexed vectors.

2. Document Parsing

The PDF Challenge

PDFs are the most common and most difficult document format for RAG systems. A PDF is fundamentally a page layout format, not a text format. Text is stored as positioned glyphs on a page, with no inherent reading order, paragraph structure, or semantic hierarchy. Tables, multi-column layouts, headers, footers, and embedded images all require specialized handling. Scanned PDFs contain only images, requiring OCR before text extraction is even possible.

Parsing Tools

PyPDF / pdfplumber: Basic Python libraries for text extraction from digital PDFs. Fast and lightweight, but struggle with complex layouts, tables, and multi-column text.
Unstructured.io: An open-source library that combines multiple parsing backends (tesseract OCR, detectron2 layout detection) to handle diverse document types. Identifies elements like titles, narrative text, tables, and images with layout-aware processing.
LlamaParse: A cloud-based document parsing service from LlamaIndex that uses LLMs to understand document structure. Excels at tables, charts, and complex layouts but introduces latency and API costs.
Docling: An open-source document parser from IBM that uses vision models for layout analysis. Handles PDFs, DOCX, PPTX, and HTML with high-fidelity structure extraction.

# Document parsing with Unstructured.io
from unstructured.partition.pdf import partition_pdf

# Parse a PDF with layout detection
elements = partition_pdf(
    filename="technical_report.pdf",
    strategy="hi_res",              # Use layout detection model
    infer_table_structure=True,     # Extract table structure
    include_page_breaks=True,       # Track page boundaries
)

# Inspect extracted elements
for element in elements[:10]:
    print(f"Type: {type(element).__name__:20s} | "
          f"Page: {element.metadata.page_number} | "
          f"Text: {str(element)[:60]}...")

# Filter by element type
from unstructured.documents.elements import Title, NarrativeText, Table

titles = [e for e in elements if isinstance(e, Title)]
text_blocks = [e for e in elements if isinstance(e, NarrativeText)]
tables = [e for e in elements if isinstance(e, Table)]

print(f"\nExtracted: {len(titles)} titles, "
      f"{len(text_blocks)} text blocks, "
      f"{len(tables)} tables")

3. Chunking Strategies

🔑 Key Insight: The Chunking Dilemma

Chunk size involves a fundamental tradeoff. Smaller chunks (100 to 200 tokens) produce more precise embeddings because each chunk covers a single topic, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a good answer. Larger chunks (500 to 1000 tokens) provide more context but may cover multiple topics, reducing embedding precision and retrieval recall. Most production systems settle on 256 to 512 tokens as a baseline, then tune based on evaluation results.

Fixed-Size Chunking

The simplest approach splits text into chunks of a fixed number of characters or tokens. While naive, fixed-size chunking is fast, deterministic, and serves as a reasonable baseline.

# Fixed-size chunking with overlap
from typing import List

def fixed_size_chunk(
    text: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50
) -> List[str]:
    """
    Split text into fixed-size chunks with overlap.

    Args:
        text: Input text to chunk
        chunk_size: Maximum characters per chunk
        chunk_overlap: Characters to overlap between consecutive chunks
    """
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        # If not the last chunk, try to break at a sentence boundary
        if end < len(text):
            # Look for sentence boundary near the end
            for boundary in [". ", ".\n", "? ", "! "]:
                last_boundary = text[start:end].rfind(boundary)
                if last_boundary > chunk_size * 0.5:
                    end = start + last_boundary + len(boundary)
                    break

        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)

        # Move start position, accounting for overlap
        start = end - chunk_overlap

    return chunks

# Example
sample_text = """
Vector databases are specialized systems designed for storing and querying
high-dimensional vectors. They use approximate nearest neighbor algorithms
to find similar vectors efficiently.

The most common algorithm is HNSW, which builds a multi-layer graph structure.
Each layer connects vectors to their nearest neighbors, enabling fast navigation
from any starting point to the target region of the vector space.

Product Quantization reduces memory usage by compressing vectors. Each vector
is split into sub-vectors, and each sub-vector is replaced by its nearest
codebook entry. This can achieve 32x compression with acceptable accuracy loss.
"""

chunks = fixed_size_chunk(sample_text, chunk_size=200, chunk_overlap=30)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:70]}...")

Chunk 0 (198 chars): Vector databases are specialized systems designed for storing and query... Chunk 1 (203 chars): The most common algorithm is HNSW, which builds a multi-layer graph st... Chunk 2 (196 chars): Product Quantization reduces memory usage by compressing vectors. Each...

Recursive Character Splitting

Recursive character splitting (popularized by LangChain) attempts to split text at the most semantically meaningful boundary possible. It tries a hierarchy of separators: first by paragraph (\n\n), then by sentence (\n), then by word ( ), and finally by character. At each level, if a chunk exceeds the size limit, it is split using the next separator in the hierarchy.

# Recursive character text splitting (LangChain-style)
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
    is_separator_regex=False,
)

document = """# Introduction to Embeddings

Text embeddings convert natural language into dense vector representations.
These vectors capture semantic meaning, allowing mathematical operations
like cosine similarity to measure how related two pieces of text are.

## Training Approaches

Modern embedding models use contrastive learning. The model is trained to
produce similar vectors for semantically related text pairs and different
vectors for unrelated pairs. Hard negative mining improves training by
providing challenging negative examples that force the model to learn
fine-grained distinctions.

## Applications

Embeddings power semantic search, recommendation systems, clustering,
and retrieval-augmented generation. They serve as the foundation for
virtually every modern NLP application that requires understanding
meaning beyond keyword matching.
"""

chunks = splitter.split_text(document)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(f"  {chunk[:80]}...")
    print()

Chunk 0 (251 chars): # Introduction to Embeddings Text embeddings convert natural language into den... Chunk 1 (308 chars): ## Training Approaches Modern embedding models use contrastive learning. The ... Chunk 2 (241 chars): ## Applications Embeddings power semantic search, recommendation systems, cl...

Semantic Chunking

Semantic chunking uses the embedding model itself to determine chunk boundaries. It computes embeddings for each sentence (or small segment), then identifies natural breakpoints where the cosine similarity between consecutive segments drops below a threshold. This produces chunks that are semantically coherent, with boundaries aligned to topic transitions.

# Semantic chunking based on embedding similarity
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import re

def semantic_chunk(
    text: str,
    model: SentenceTransformer,
    threshold_percentile: int = 25,
    min_chunk_size: int = 100,
) -> List[str]:
    """
    Split text into semantically coherent chunks by detecting
    topic boundaries using embedding similarity.
    """
    # Split into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if len(s) > 10]

    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    embeddings = model.encode(sentences, normalize_embeddings=True)

    # Compute cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1])
        similarities.append(sim)

    # Find breakpoints where similarity drops below threshold
    threshold = np.percentile(similarities, threshold_percentile)
    breakpoints = [i + 1 for i, sim in enumerate(similarities)
                   if sim < threshold]

    # Build chunks from breakpoints
    chunks = []
    start = 0
    for bp in breakpoints:
        chunk = " ".join(sentences[start:bp])
        if len(chunk) >= min_chunk_size:
            chunks.append(chunk)
            start = bp

    # Add remaining sentences
    final_chunk = " ".join(sentences[start:])
    if final_chunk:
        chunks.append(final_chunk)

    return chunks

# Example usage
model = SentenceTransformer("all-MiniLM-L6-v2")
text = """
Machine learning models learn patterns from data. They adjust internal
parameters to minimize prediction errors. The training process uses
gradient descent to iteratively improve the model.

Vector databases store high-dimensional vectors. They use algorithms like
HNSW for fast approximate nearest neighbor search. These systems are
critical for semantic search applications.

Python is the most popular language for data science. It provides libraries
like NumPy, pandas, and scikit-learn. The ecosystem continues to grow rapidly.
"""

chunks = semantic_chunk(text, model)
for i, chunk in enumerate(chunks):
    print(f"Semantic Chunk {i}: {chunk[:70]}...")

Structure-Aware Chunking

When documents have clear structural elements (headings, sections, subsections), the most effective strategy respects this structure. Structure-aware chunking uses document hierarchy to create chunks that align with the author's intended organization. A section with its heading forms a natural chunk; a table stays intact rather than being split across chunks.

Figure 18.9: Fixed-size chunking breaks at arbitrary points; recursive splitting respects paragraphs; structure-aware chunking preserves semantic units.

4. Overlap and Parent-Child Retrieval

Chunk Overlap

Adding overlap between consecutive chunks ensures that sentences at chunk boundaries are not lost in context. A typical overlap of 10 to 20% of the chunk size (e.g., 50 to 100 tokens for a 500-token chunk) provides continuity without excessive duplication. Too much overlap wastes storage and can introduce duplicate results; too little risks losing context at boundaries.

Parent-Child (Small-to-Big) Retrieval

The parent-child strategy addresses the chunk-size dilemma by decoupling the retrieval unit from the context unit. Small chunks (child chunks, 100 to 200 tokens) are used for embedding and retrieval because their focused content produces precise embeddings. When a child chunk is retrieved, the system returns the larger parent chunk (500 to 1000 tokens) that contains it, providing the LLM with sufficient context to generate a high-quality answer.

# Parent-child chunking strategy
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import List, Dict
import uuid

def create_parent_child_chunks(
    text: str,
    parent_chunk_size: int = 1000,
    child_chunk_size: int = 200,
    child_overlap: int = 20,
) -> List[Dict]:
    """
    Create a two-tier chunking structure for parent-child retrieval.

    Child chunks are used for embedding and retrieval.
    Parent chunks are returned for LLM context.
    """
    # Create parent chunks
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=parent_chunk_size,
        chunk_overlap=0,
    )
    parent_chunks = parent_splitter.split_text(text)

    all_chunks = []

    for parent_idx, parent_text in enumerate(parent_chunks):
        parent_id = str(uuid.uuid4())

        # Store parent chunk
        all_chunks.append({
            "id": parent_id,
            "text": parent_text,
            "type": "parent",
            "parent_id": None,
        })

        # Create child chunks from this parent
        child_splitter = RecursiveCharacterTextSplitter(
            chunk_size=child_chunk_size,
            chunk_overlap=child_overlap,
        )
        child_texts = child_splitter.split_text(parent_text)

        for child_idx, child_text in enumerate(child_texts):
            all_chunks.append({
                "id": str(uuid.uuid4()),
                "text": child_text,
                "type": "child",
                "parent_id": parent_id,
            })

    parents = [c for c in all_chunks if c["type"] == "parent"]
    children = [c for c in all_chunks if c["type"] == "child"]
    print(f"Created {len(parents)} parents, {len(children)} children")
    print(f"Avg parent size: {sum(len(p['text']) for p in parents) / len(parents):.0f} chars")
    print(f"Avg child size: {sum(len(c['text']) for c in children) / len(children):.0f} chars")

    return all_chunks

# Usage: embed children, retrieve parents
# At query time:
# 1. Search child embeddings for top-k matches
# 2. For each matching child, look up its parent_id
# 3. Return deduplicated parent chunks to the LLM

📘 Sentence Window Retrieval

A variation of parent-child retrieval is sentence window retrieval. Each sentence is embedded individually for maximum retrieval precision. When a sentence matches, the system returns a window of surrounding sentences (e.g., 3 sentences before and after) as context. This provides a fine-grained retrieval unit with a flexible context window, and it avoids the need to predefine parent chunk boundaries. LlamaIndex provides a built-in SentenceWindowNodeParser for this pattern.

5. Chunking Strategy Comparison

Strategy	Pros	Cons	Best For
Fixed-size	Simple, fast, predictable	Splits mid-sentence, ignores structure	Baseline, homogeneous text
Recursive	Respects natural boundaries, configurable	May still break complex elements	General purpose (default choice)
Semantic	Topic-coherent chunks, data-driven boundaries	Slower (requires embeddings), variable sizes	Long-form content, mixed topics
Structure-aware	Preserves document hierarchy, best quality	Requires structural parsing, format-specific	Structured docs (manuals, reports)
Parent-child	Precise retrieval with rich context	More complex pipeline, extra storage	High-stakes RAG applications
Sentence window	Maximum retrieval precision	Many embeddings, higher index cost	Q&A over dense technical content

6. Production RAG ETL Pipelines

A production ingestion pipeline must handle document updates, deletions, and versioning in addition to initial loading. The key engineering challenges include:

Incremental Indexing

When documents are updated, you must re-chunk and re-embed only the changed documents, not the entire corpus. This requires tracking document versions (typically via content hashes or timestamps) and maintaining a mapping between source documents and their chunks in the vector database.

# Incremental indexing with content hashing
import hashlib
import json
from typing import Dict, List, Optional
from pathlib import Path

class IncrementalIndexer:
    """
    Tracks document versions to enable incremental re-indexing.
    Only processes documents that have changed since the last run.
    """

    def __init__(self, state_file: str = "indexer_state.json"):
        self.state_file = Path(state_file)
        self.state: Dict[str, str] = {}
        if self.state_file.exists():
            self.state = json.loads(self.state_file.read_text())

    def content_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()

    def get_changes(
        self, documents: Dict[str, str]
    ) -> Dict[str, List[str]]:
        """
        Compare current documents against stored state.

        Args:
            documents: dict of {doc_id: content}

        Returns:
            {"added": [...], "modified": [...], "deleted": [...]}
        """
        current_ids = set(documents.keys())
        stored_ids = set(self.state.keys())

        added = current_ids - stored_ids
        deleted = stored_ids - current_ids
        modified = set()

        for doc_id in current_ids & stored_ids:
            new_hash = self.content_hash(documents[doc_id])
            if new_hash != self.state[doc_id]:
                modified.add(doc_id)

        return {
            "added": list(added),
            "modified": list(modified),
            "deleted": list(deleted),
        }

    def update_state(self, documents: Dict[str, str]):
        """Update stored hashes after successful indexing."""
        for doc_id, content in documents.items():
            self.state[doc_id] = self.content_hash(content)
        self.state_file.write_text(json.dumps(self.state, indent=2))

    def process_changes(self, documents: Dict[str, str]):
        """Main entry point for incremental processing."""
        changes = self.get_changes(documents)

        print(f"Added:    {len(changes['added'])} documents")
        print(f"Modified: {len(changes['modified'])} documents")
        print(f"Deleted:  {len(changes['deleted'])} documents")

        # For added/modified: chunk, embed, upsert
        to_process = changes["added"] + changes["modified"]
        if to_process:
            print(f"Processing {len(to_process)} documents...")
            # chunk_and_embed(to_process)
            # vector_db.upsert(chunks)

        # For deleted: remove from vector DB
        if changes["deleted"]:
            print(f"Removing {len(changes['deleted'])} documents...")
            # vector_db.delete(filter={"doc_id": {"$in": changes["deleted"]}})

        # For modified: also remove old chunks before upserting new ones
        if changes["modified"]:
            print(f"Replacing chunks for {len(changes['modified'])} documents...")
            # vector_db.delete(filter={"doc_id": {"$in": changes["modified"]}})
            # vector_db.upsert(new_chunks)

        self.update_state(documents)

# Usage
indexer = IncrementalIndexer()
docs = {
    "report_2024.pdf": "Full text of the 2024 report...",
    "manual_v3.pdf": "Updated product manual content...",
    "faq.md": "Frequently asked questions...",
}
indexer.process_changes(docs)

Added: 3 documents Modified: 0 documents Deleted: 0 documents Processing 3 documents...

Metadata Enrichment

Every chunk should carry metadata that enables effective filtering and attribution. Essential metadata fields include:

Source: The original file name or URL for citation and deduplication.
Page/section: Location within the source document for precise references.
Title hierarchy: Section and subsection headings for contextual understanding.
Date: Creation or last-modified date for recency filtering.
Document type: Category labels (policy, FAQ, report, transcript) for scoped search.
Access permissions: User or group identifiers for access-controlled retrieval.

⚠ Common Chunking Mistakes

The most common mistakes in document processing are: (1) Not evaluating chunking quality by measuring retrieval performance with different strategies and parameters on representative queries. (2) Ignoring document structure by applying the same chunking strategy to all document types. (3) Losing metadata context by stripping headers, section titles, or table captions during chunking. (4) Using the default settings of your framework without tuning chunk size and overlap for your specific content and queries. (5) Not handling tables and figures as special elements that should either be kept intact or described textually.

7. Evaluation and Iteration

Chunking is not a one-time configuration; it requires ongoing evaluation and tuning. The most effective approach is to build a small evaluation set of 50 to 100 representative queries with known relevant passages, then measure retrieval metrics (recall@k, MRR, NDCG) across different chunking configurations. Systematic A/B testing of chunking strategies often reveals that the optimal configuration depends heavily on the document type and query patterns specific to your application.

Figure 18.10: Chunking quality requires iterative evaluation against representative queries with ground-truth relevance judgments.

Section 18.4 Quiz

1. Why is chunking quality often the most important factor in RAG system performance?

Show Answer

Chunking determines what the retriever can find. If a relevant answer is split across two chunks at an unfortunate boundary, the retriever may never surface it as a coherent result. If a chunk mixes two unrelated topics, its embedding will be a noisy average that matches neither topic well. No embedding model or vector database can compensate for poorly chunked documents because the quality of retrieval is fundamentally bounded by the quality of the units being retrieved.

2. What is the fundamental tradeoff in choosing chunk size?

Show Answer

Smaller chunks (100 to 200 tokens) produce more focused embeddings that match specific queries precisely, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a complete answer. Larger chunks (500 to 1000 tokens) provide richer context but may cover multiple topics, making their embeddings less precise and reducing retrieval recall. The parent-child strategy resolves this by using small chunks for retrieval and returning larger parent chunks for LLM context.

3. How does semantic chunking differ from recursive character splitting?

Show Answer

Recursive character splitting uses predefined text separators (paragraph breaks, newlines, spaces) in a hierarchical order to find chunk boundaries. It is rule-based and deterministic. Semantic chunking uses the embedding model itself to determine boundaries: it embeds each sentence, computes similarity between consecutive sentences, and splits where similarity drops significantly. This produces chunks aligned with actual topic transitions rather than formatting conventions. Semantic chunking is slower (it requires embedding all sentences) but produces more coherent chunks for content with complex topic structure.

4. How does parent-child retrieval solve the chunk-size dilemma?

Show Answer

Parent-child retrieval decouples the retrieval unit from the context unit. Small child chunks (100 to 200 tokens) are embedded and used for retrieval because their focused content produces precise, topic-specific embeddings. When a child chunk matches a query, the system looks up its associated parent chunk (500 to 1000 tokens) and returns the parent to the LLM. This provides the precision of small chunks for matching while giving the LLM the broader context of large chunks for answer generation.

5. What is incremental indexing and why is it necessary for production systems?

Show Answer

Incremental indexing tracks document versions (via content hashes or timestamps) and processes only documents that have been added, modified, or deleted since the last indexing run. It is necessary because re-processing an entire corpus on every update is expensive and slow. A production system with thousands of documents that change daily must detect which documents have changed, remove old chunks for modified or deleted documents, and insert new chunks, all without reprocessing unchanged documents. This requires maintaining a mapping between source documents and their chunks in the vector database.

Key Takeaways

Chunking quality bounds RAG quality. No downstream component can compensate for chunks that split relevant information or mix unrelated topics.
Recursive character splitting is the best default for most text content, balancing simplicity with respect for natural text boundaries.
Semantic chunking produces the most coherent chunks by detecting topic boundaries via embedding similarity, at the cost of additional computation.
Structure-aware chunking is essential for formatted documents (PDFs, HTML, Markdown) where headings, tables, and figures define natural semantic units.
Parent-child retrieval resolves the chunk-size tradeoff by using small chunks for precise retrieval and large chunks for LLM context.
Always enrich chunks with metadata (source, page, section title, date) to enable filtered search and proper attribution.
Build an evaluation set of representative queries with known relevant passages, and systematically test chunking configurations against retrieval metrics.
Incremental indexing with content hashing is essential for production pipelines that process evolving document collections.