Module 18 · Section 18.4

Document Processing & Chunking

Chunking strategies, document parsing pipelines, overlap design, and building production RAG ETL workflows
★ Big Picture

The quality of your RAG system is bounded by the quality of your chunks. No embedding model or vector database can compensate for poorly chunked documents. If a relevant answer spans two chunks that were split in the wrong place, the retriever will never surface it as a single coherent result. Document processing and chunking is where most RAG systems succeed or fail, yet it receives far less attention than model selection or index tuning. This section covers chunking strategies from basic to advanced, document parsing tools for complex formats, and the engineering of production-grade ingestion pipelines.

1. The Document Processing Pipeline

Before text can be embedded and indexed, raw documents must pass through a multi-stage processing pipeline. Each stage introduces potential failure modes that can degrade retrieval quality downstream.

  1. Loading: Reading raw files from various sources (file systems, S3, URLs, databases, APIs).
  2. Parsing: Extracting text and structure from complex formats (PDF, DOCX, HTML, slides, scanned images).
  3. Cleaning: Removing headers, footers, page numbers, boilerplate, and artifacts from parsing.
  4. Chunking: Splitting cleaned text into segments suitable for embedding and retrieval.
  5. Enrichment: Adding metadata (source, page number, section title, date) to each chunk.
  6. Embedding: Converting chunks to vectors using the selected embedding model.
  7. Indexing: Storing vectors and metadata in the vector database.
RAG Document Ingestion Pipeline Load PDF, DOCX Parse Extract text Clean Normalize Chunk Split text Embed Vectorize Index Store in DB Vector DB Each stage introduces potential failure modes that compound downstream Metadata enrichment: source, page, section, date, permissions
Figure 18.8: The RAG document ingestion pipeline from raw files to indexed vectors.

2. Document Parsing

The PDF Challenge

PDFs are the most common and most difficult document format for RAG systems. A PDF is fundamentally a page layout format, not a text format. Text is stored as positioned glyphs on a page, with no inherent reading order, paragraph structure, or semantic hierarchy. Tables, multi-column layouts, headers, footers, and embedded images all require specialized handling. Scanned PDFs contain only images, requiring OCR before text extraction is even possible.

Parsing Tools

# Document parsing with Unstructured.io
from unstructured.partition.pdf import partition_pdf

# Parse a PDF with layout detection
elements = partition_pdf(
    filename="technical_report.pdf",
    strategy="hi_res",              # Use layout detection model
    infer_table_structure=True,     # Extract table structure
    include_page_breaks=True,       # Track page boundaries
)

# Inspect extracted elements
for element in elements[:10]:
    print(f"Type: {type(element).__name__:20s} | "
          f"Page: {element.metadata.page_number} | "
          f"Text: {str(element)[:60]}...")

# Filter by element type
from unstructured.documents.elements import Title, NarrativeText, Table

titles = [e for e in elements if isinstance(e, Title)]
text_blocks = [e for e in elements if isinstance(e, NarrativeText)]
tables = [e for e in elements if isinstance(e, Table)]

print(f"\nExtracted: {len(titles)} titles, "
      f"{len(text_blocks)} text blocks, "
      f"{len(tables)} tables")
Type: Title | Page: 1 | Text: Technical Report: Vector Database Performance Analy... Type: NarrativeText | Page: 1 | Text: This report presents a comprehensive benchmark of v... Type: NarrativeText | Page: 1 | Text: We evaluated five vector database systems across th... Type: Title | Page: 2 | Text: Methodology... Type: NarrativeText | Page: 2 | Text: Our benchmark framework measures three primary dime... Type: Table | Page: 2 | Text: System | QPS | Recall@10 | P99 Latency... Extracted: 8 titles, 24 text blocks, 3 tables

3. Chunking Strategies

🔑 Key Insight: The Chunking Dilemma

Chunk size involves a fundamental tradeoff. Smaller chunks (100 to 200 tokens) produce more precise embeddings because each chunk covers a single topic, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a good answer. Larger chunks (500 to 1000 tokens) provide more context but may cover multiple topics, reducing embedding precision and retrieval recall. Most production systems settle on 256 to 512 tokens as a baseline, then tune based on evaluation results.

Fixed-Size Chunking

The simplest approach splits text into chunks of a fixed number of characters or tokens. While naive, fixed-size chunking is fast, deterministic, and serves as a reasonable baseline.

# Fixed-size chunking with overlap
from typing import List

def fixed_size_chunk(
    text: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50
) -> List[str]:
    """
    Split text into fixed-size chunks with overlap.

    Args:
        text: Input text to chunk
        chunk_size: Maximum characters per chunk
        chunk_overlap: Characters to overlap between consecutive chunks
    """
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size

        # If not the last chunk, try to break at a sentence boundary
        if end < len(text):
            # Look for sentence boundary near the end
            for boundary in [". ", ".\n", "? ", "! "]:
                last_boundary = text[start:end].rfind(boundary)
                if last_boundary > chunk_size * 0.5:
                    end = start + last_boundary + len(boundary)
                    break

        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)

        # Move start position, accounting for overlap
        start = end - chunk_overlap

    return chunks

# Example
sample_text = """
Vector databases are specialized systems designed for storing and querying
high-dimensional vectors. They use approximate nearest neighbor algorithms
to find similar vectors efficiently.

The most common algorithm is HNSW, which builds a multi-layer graph structure.
Each layer connects vectors to their nearest neighbors, enabling fast navigation
from any starting point to the target region of the vector space.

Product Quantization reduces memory usage by compressing vectors. Each vector
is split into sub-vectors, and each sub-vector is replaced by its nearest
codebook entry. This can achieve 32x compression with acceptable accuracy loss.
"""

chunks = fixed_size_chunk(sample_text, chunk_size=200, chunk_overlap=30)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk)} chars): {chunk[:70]}...")
Chunk 0 (198 chars): Vector databases are specialized systems designed for storing and query... Chunk 1 (203 chars): The most common algorithm is HNSW, which builds a multi-layer graph st... Chunk 2 (196 chars): Product Quantization reduces memory usage by compressing vectors. Each...

Recursive Character Splitting

Recursive character splitting (popularized by LangChain) attempts to split text at the most semantically meaningful boundary possible. It tries a hierarchy of separators: first by paragraph (\n\n), then by sentence (\n), then by word ( ), and finally by character. At each level, if a chunk exceeds the size limit, it is split using the next separator in the hierarchy.

# Recursive character text splitting (LangChain-style)
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
    is_separator_regex=False,
)

document = """# Introduction to Embeddings

Text embeddings convert natural language into dense vector representations.
These vectors capture semantic meaning, allowing mathematical operations
like cosine similarity to measure how related two pieces of text are.

## Training Approaches

Modern embedding models use contrastive learning. The model is trained to
produce similar vectors for semantically related text pairs and different
vectors for unrelated pairs. Hard negative mining improves training by
providing challenging negative examples that force the model to learn
fine-grained distinctions.

## Applications

Embeddings power semantic search, recommendation systems, clustering,
and retrieval-augmented generation. They serve as the foundation for
virtually every modern NLP application that requires understanding
meaning beyond keyword matching.
"""

chunks = splitter.split_text(document)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(f"  {chunk[:80]}...")
    print()
Chunk 0 (251 chars): # Introduction to Embeddings Text embeddings convert natural language into den... Chunk 1 (308 chars): ## Training Approaches Modern embedding models use contrastive learning. The ... Chunk 2 (241 chars): ## Applications Embeddings power semantic search, recommendation systems, cl...

Semantic Chunking

Semantic chunking uses the embedding model itself to determine chunk boundaries. It computes embeddings for each sentence (or small segment), then identifies natural breakpoints where the cosine similarity between consecutive segments drops below a threshold. This produces chunks that are semantically coherent, with boundaries aligned to topic transitions.

# Semantic chunking based on embedding similarity
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import List, Tuple
import re

def semantic_chunk(
    text: str,
    model: SentenceTransformer,
    threshold_percentile: int = 25,
    min_chunk_size: int = 100,
) -> List[str]:
    """
    Split text into semantically coherent chunks by detecting
    topic boundaries using embedding similarity.
    """
    # Split into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    sentences = [s for s in sentences if len(s) > 10]

    if len(sentences) <= 1:
        return [text]

    # Embed all sentences
    embeddings = model.encode(sentences, normalize_embeddings=True)

    # Compute cosine similarity between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        sim = np.dot(embeddings[i], embeddings[i + 1])
        similarities.append(sim)

    # Find breakpoints where similarity drops below threshold
    threshold = np.percentile(similarities, threshold_percentile)
    breakpoints = [i + 1 for i, sim in enumerate(similarities)
                   if sim < threshold]

    # Build chunks from breakpoints
    chunks = []
    start = 0
    for bp in breakpoints:
        chunk = " ".join(sentences[start:bp])
        if len(chunk) >= min_chunk_size:
            chunks.append(chunk)
            start = bp

    # Add remaining sentences
    final_chunk = " ".join(sentences[start:])
    if final_chunk:
        chunks.append(final_chunk)

    return chunks

# Example usage
model = SentenceTransformer("all-MiniLM-L6-v2")
text = """
Machine learning models learn patterns from data. They adjust internal
parameters to minimize prediction errors. The training process uses
gradient descent to iteratively improve the model.

Vector databases store high-dimensional vectors. They use algorithms like
HNSW for fast approximate nearest neighbor search. These systems are
critical for semantic search applications.

Python is the most popular language for data science. It provides libraries
like NumPy, pandas, and scikit-learn. The ecosystem continues to grow rapidly.
"""

chunks = semantic_chunk(text, model)
for i, chunk in enumerate(chunks):
    print(f"Semantic Chunk {i}: {chunk[:70]}...")

Structure-Aware Chunking

When documents have clear structural elements (headings, sections, subsections), the most effective strategy respects this structure. Structure-aware chunking uses document hierarchy to create chunks that align with the author's intended organization. A section with its heading forms a natural chunk; a table stays intact rather than being split across chunks.

Chunking Strategy Comparison Fixed-Size ## Heading + partial paragraph (500 chars) ...end of paragraph + table start (500 chars) ...table end + next section start (500 chars) Splits mid-sentence Breaks tables Mixes topics Recursive ## Heading Full paragraph 1 (split at \n\n) Table (may split if large) (split at \n) ## Next Section paragraph (split at \n\n) Respects paragraphs May still break tables Good default choice Structure-Aware ## Section 1 Complete section content (heading-bounded) Table (kept intact) (element-preserved) ## Section 2 complete (heading-bounded) Topic-coherent chunks Preserves tables Best retrieval quality
Figure 18.9: Fixed-size chunking breaks at arbitrary points; recursive splitting respects paragraphs; structure-aware chunking preserves semantic units.

4. Overlap and Parent-Child Retrieval

Chunk Overlap

Adding overlap between consecutive chunks ensures that sentences at chunk boundaries are not lost in context. A typical overlap of 10 to 20% of the chunk size (e.g., 50 to 100 tokens for a 500-token chunk) provides continuity without excessive duplication. Too much overlap wastes storage and can introduce duplicate results; too little risks losing context at boundaries.

Parent-Child (Small-to-Big) Retrieval

The parent-child strategy addresses the chunk-size dilemma by decoupling the retrieval unit from the context unit. Small chunks (child chunks, 100 to 200 tokens) are used for embedding and retrieval because their focused content produces precise embeddings. When a child chunk is retrieved, the system returns the larger parent chunk (500 to 1000 tokens) that contains it, providing the LLM with sufficient context to generate a high-quality answer.

# Parent-child chunking strategy
from langchain_text_splitters import RecursiveCharacterTextSplitter
from typing import List, Dict
import uuid

def create_parent_child_chunks(
    text: str,
    parent_chunk_size: int = 1000,
    child_chunk_size: int = 200,
    child_overlap: int = 20,
) -> List[Dict]:
    """
    Create a two-tier chunking structure for parent-child retrieval.

    Child chunks are used for embedding and retrieval.
    Parent chunks are returned for LLM context.
    """
    # Create parent chunks
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=parent_chunk_size,
        chunk_overlap=0,
    )
    parent_chunks = parent_splitter.split_text(text)

    all_chunks = []

    for parent_idx, parent_text in enumerate(parent_chunks):
        parent_id = str(uuid.uuid4())

        # Store parent chunk
        all_chunks.append({
            "id": parent_id,
            "text": parent_text,
            "type": "parent",
            "parent_id": None,
        })

        # Create child chunks from this parent
        child_splitter = RecursiveCharacterTextSplitter(
            chunk_size=child_chunk_size,
            chunk_overlap=child_overlap,
        )
        child_texts = child_splitter.split_text(parent_text)

        for child_idx, child_text in enumerate(child_texts):
            all_chunks.append({
                "id": str(uuid.uuid4()),
                "text": child_text,
                "type": "child",
                "parent_id": parent_id,
            })

    parents = [c for c in all_chunks if c["type"] == "parent"]
    children = [c for c in all_chunks if c["type"] == "child"]
    print(f"Created {len(parents)} parents, {len(children)} children")
    print(f"Avg parent size: {sum(len(p['text']) for p in parents) / len(parents):.0f} chars")
    print(f"Avg child size: {sum(len(c['text']) for c in children) / len(children):.0f} chars")

    return all_chunks

# Usage: embed children, retrieve parents
# At query time:
# 1. Search child embeddings for top-k matches
# 2. For each matching child, look up its parent_id
# 3. Return deduplicated parent chunks to the LLM
📘 Sentence Window Retrieval

A variation of parent-child retrieval is sentence window retrieval. Each sentence is embedded individually for maximum retrieval precision. When a sentence matches, the system returns a window of surrounding sentences (e.g., 3 sentences before and after) as context. This provides a fine-grained retrieval unit with a flexible context window, and it avoids the need to predefine parent chunk boundaries. LlamaIndex provides a built-in SentenceWindowNodeParser for this pattern.

5. Chunking Strategy Comparison

Strategy Pros Cons Best For
Fixed-size Simple, fast, predictable Splits mid-sentence, ignores structure Baseline, homogeneous text
Recursive Respects natural boundaries, configurable May still break complex elements General purpose (default choice)
Semantic Topic-coherent chunks, data-driven boundaries Slower (requires embeddings), variable sizes Long-form content, mixed topics
Structure-aware Preserves document hierarchy, best quality Requires structural parsing, format-specific Structured docs (manuals, reports)
Parent-child Precise retrieval with rich context More complex pipeline, extra storage High-stakes RAG applications
Sentence window Maximum retrieval precision Many embeddings, higher index cost Q&A over dense technical content

6. Production RAG ETL Pipelines

A production ingestion pipeline must handle document updates, deletions, and versioning in addition to initial loading. The key engineering challenges include:

Incremental Indexing

When documents are updated, you must re-chunk and re-embed only the changed documents, not the entire corpus. This requires tracking document versions (typically via content hashes or timestamps) and maintaining a mapping between source documents and their chunks in the vector database.

# Incremental indexing with content hashing
import hashlib
import json
from typing import Dict, List, Optional
from pathlib import Path

class IncrementalIndexer:
    """
    Tracks document versions to enable incremental re-indexing.
    Only processes documents that have changed since the last run.
    """

    def __init__(self, state_file: str = "indexer_state.json"):
        self.state_file = Path(state_file)
        self.state: Dict[str, str] = {}
        if self.state_file.exists():
            self.state = json.loads(self.state_file.read_text())

    def content_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()

    def get_changes(
        self, documents: Dict[str, str]
    ) -> Dict[str, List[str]]:
        """
        Compare current documents against stored state.

        Args:
            documents: dict of {doc_id: content}

        Returns:
            {"added": [...], "modified": [...], "deleted": [...]}
        """
        current_ids = set(documents.keys())
        stored_ids = set(self.state.keys())

        added = current_ids - stored_ids
        deleted = stored_ids - current_ids
        modified = set()

        for doc_id in current_ids & stored_ids:
            new_hash = self.content_hash(documents[doc_id])
            if new_hash != self.state[doc_id]:
                modified.add(doc_id)

        return {
            "added": list(added),
            "modified": list(modified),
            "deleted": list(deleted),
        }

    def update_state(self, documents: Dict[str, str]):
        """Update stored hashes after successful indexing."""
        for doc_id, content in documents.items():
            self.state[doc_id] = self.content_hash(content)
        self.state_file.write_text(json.dumps(self.state, indent=2))

    def process_changes(self, documents: Dict[str, str]):
        """Main entry point for incremental processing."""
        changes = self.get_changes(documents)

        print(f"Added:    {len(changes['added'])} documents")
        print(f"Modified: {len(changes['modified'])} documents")
        print(f"Deleted:  {len(changes['deleted'])} documents")

        # For added/modified: chunk, embed, upsert
        to_process = changes["added"] + changes["modified"]
        if to_process:
            print(f"Processing {len(to_process)} documents...")
            # chunk_and_embed(to_process)
            # vector_db.upsert(chunks)

        # For deleted: remove from vector DB
        if changes["deleted"]:
            print(f"Removing {len(changes['deleted'])} documents...")
            # vector_db.delete(filter={"doc_id": {"$in": changes["deleted"]}})

        # For modified: also remove old chunks before upserting new ones
        if changes["modified"]:
            print(f"Replacing chunks for {len(changes['modified'])} documents...")
            # vector_db.delete(filter={"doc_id": {"$in": changes["modified"]}})
            # vector_db.upsert(new_chunks)

        self.update_state(documents)

# Usage
indexer = IncrementalIndexer()
docs = {
    "report_2024.pdf": "Full text of the 2024 report...",
    "manual_v3.pdf": "Updated product manual content...",
    "faq.md": "Frequently asked questions...",
}
indexer.process_changes(docs)
Added: 3 documents Modified: 0 documents Deleted: 0 documents Processing 3 documents...

Metadata Enrichment

Every chunk should carry metadata that enables effective filtering and attribution. Essential metadata fields include:

⚠ Common Chunking Mistakes

The most common mistakes in document processing are: (1) Not evaluating chunking quality by measuring retrieval performance with different strategies and parameters on representative queries. (2) Ignoring document structure by applying the same chunking strategy to all document types. (3) Losing metadata context by stripping headers, section titles, or table captions during chunking. (4) Using the default settings of your framework without tuning chunk size and overlap for your specific content and queries. (5) Not handling tables and figures as special elements that should either be kept intact or described textually.

7. Evaluation and Iteration

Chunking is not a one-time configuration; it requires ongoing evaluation and tuning. The most effective approach is to build a small evaluation set of 50 to 100 representative queries with known relevant passages, then measure retrieval metrics (recall@k, MRR, NDCG) across different chunking configurations. Systematic A/B testing of chunking strategies often reveals that the optimal configuration depends heavily on the document type and query patterns specific to your application.

Chunking Evaluation Loop Configure Strategy + params Chunk + Embed Process test docs Evaluate Recall@k, MRR Analyze Failure cases Iterate: adjust strategy, chunk size, overlap, metadata
Figure 18.10: Chunking quality requires iterative evaluation against representative queries with ground-truth relevance judgments.

Section 18.4 Quiz

1. Why is chunking quality often the most important factor in RAG system performance?

Show Answer
Chunking determines what the retriever can find. If a relevant answer is split across two chunks at an unfortunate boundary, the retriever may never surface it as a coherent result. If a chunk mixes two unrelated topics, its embedding will be a noisy average that matches neither topic well. No embedding model or vector database can compensate for poorly chunked documents because the quality of retrieval is fundamentally bounded by the quality of the units being retrieved.

2. What is the fundamental tradeoff in choosing chunk size?

Show Answer
Smaller chunks (100 to 200 tokens) produce more focused embeddings that match specific queries precisely, improving retrieval precision. However, they may lack sufficient context for the LLM to generate a complete answer. Larger chunks (500 to 1000 tokens) provide richer context but may cover multiple topics, making their embeddings less precise and reducing retrieval recall. The parent-child strategy resolves this by using small chunks for retrieval and returning larger parent chunks for LLM context.

3. How does semantic chunking differ from recursive character splitting?

Show Answer
Recursive character splitting uses predefined text separators (paragraph breaks, newlines, spaces) in a hierarchical order to find chunk boundaries. It is rule-based and deterministic. Semantic chunking uses the embedding model itself to determine boundaries: it embeds each sentence, computes similarity between consecutive sentences, and splits where similarity drops significantly. This produces chunks aligned with actual topic transitions rather than formatting conventions. Semantic chunking is slower (it requires embedding all sentences) but produces more coherent chunks for content with complex topic structure.

4. How does parent-child retrieval solve the chunk-size dilemma?

Show Answer
Parent-child retrieval decouples the retrieval unit from the context unit. Small child chunks (100 to 200 tokens) are embedded and used for retrieval because their focused content produces precise, topic-specific embeddings. When a child chunk matches a query, the system looks up its associated parent chunk (500 to 1000 tokens) and returns the parent to the LLM. This provides the precision of small chunks for matching while giving the LLM the broader context of large chunks for answer generation.

5. What is incremental indexing and why is it necessary for production systems?

Show Answer
Incremental indexing tracks document versions (via content hashes or timestamps) and processes only documents that have been added, modified, or deleted since the last indexing run. It is necessary because re-processing an entire corpus on every update is expensive and slow. A production system with thousands of documents that change daily must detect which documents have changed, remove old chunks for modified or deleted documents, and insert new chunks, all without reprocessing unchanged documents. This requires maintaining a mapping between source documents and their chunks in the vector database.

Key Takeaways