RAG frameworks transform weeks of plumbing into hours of configuration. Building a production RAG system from raw API calls requires wiring together embedding models, vector stores, retrievers, rerankers, prompt templates, and LLM calls. Frameworks like LangChain, LlamaIndex, and Haystack provide pre-built abstractions for these components, letting you swap implementations without rewriting your pipeline. Understanding each framework's philosophy, strengths, and trade-offs is essential for choosing the right tool (or deciding to go without one entirely).
1. Why Use a RAG Framework?
A minimal RAG pipeline requires at least five distinct operations: loading documents, splitting them into chunks, computing embeddings, storing vectors in a database, and orchestrating retrieval with LLM generation. Each of these steps has multiple implementation choices (sentence splitters vs. recursive splitters, OpenAI embeddings vs. Cohere embeddings, Pinecone vs. Chroma vs. pgvector). Without a framework, every component switch requires rewriting integration code.
RAG frameworks solve this by providing a common interface layer. A retriever is a retriever regardless of whether it queries Pinecone or Weaviate underneath. A text splitter is a text splitter whether it uses token counts or recursive character boundaries. This abstraction brings three concrete benefits: faster prototyping, easier component swapping during evaluation, and a shared vocabulary that simplifies team communication.
However, frameworks also introduce complexity. They add layers of abstraction that can obscure what is actually happening, they impose opinions about pipeline structure that may not match your needs, and they evolve rapidly, sometimes introducing breaking changes. The decision to adopt a framework should weigh these trade-offs against the complexity of your specific use case.
2. LangChain
LangChain is the most widely adopted framework for LLM application development. Originally built
around the concept of "chains" (sequential pipelines of operations), it has evolved into a
comprehensive ecosystem with separate packages for core abstractions (langchain-core),
community integrations (langchain-community), and the orchestration runtime
(langgraph). For RAG specifically, LangChain provides document loaders, text splitters,
embedding models, vector stores, retrievers, and output parsers as composable building blocks.
2.1 Core Concepts
LangChain's architecture revolves around several key abstractions. Document loaders ingest
data from PDFs, web pages, databases, and dozens of other sources into a uniform Document
object. Text splitters break documents into chunks with configurable size and overlap.
Retrievers provide a standard interface for fetching relevant documents, whether from
a vector store, a BM25 index, or a custom API. Chains wire these components together
into executable pipelines.
2.2 LCEL (LangChain Expression Language)
LCEL is LangChain's declarative composition syntax, introduced to replace imperative chain construction.
Using the pipe operator (|), LCEL lets you compose components into readable pipelines
that support streaming, batching, and async execution out of the box. Each component in an LCEL
pipeline implements the Runnable interface, meaning it has invoke,
stream, batch, and ainvoke methods automatically.
Example 1: RAG Pipeline with LangChain LCEL
from langchain_openai import ChatOpenAI, OpenAIEmbeddings from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # Initialize components embeddings = OpenAIEmbeddings(model="text-embedding-3-small") vectorstore = Chroma( collection_name="docs", embedding_function=embeddings, persist_directory="./chroma_db" ) retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 5} ) llm = ChatOpenAI(model="gpt-4o", temperature=0) # Define the prompt template template = """Answer the question based only on the following context: {context} Question: {question} Provide a detailed answer. If the context does not contain enough information, say so explicitly.""" prompt = ChatPromptTemplate.from_template(template) # Helper to format retrieved documents def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) # LCEL pipeline: pipe operator composes Runnables rag_chain = ( { "context": retriever | format_docs, "question": RunnablePassthrough() } | prompt | llm | StrOutputParser() ) # Invoke the pipeline answer = rag_chain.invoke("What are the key benefits of RAG?") print(answer) # Streaming is automatic with LCEL for chunk in rag_chain.stream("Explain hybrid search approaches"): print(chunk, end="", flush=True)
2.3 Memory and Conversation
For conversational RAG, LangChain provides memory modules that persist chat history across turns.
The simplest is ConversationBufferMemory, which stores all messages. For long conversations,
ConversationSummaryMemory uses an LLM to compress earlier turns into a summary, keeping
the context window manageable. In the newer LangGraph paradigm, state management replaces these
memory classes with explicit graph state that flows between nodes, providing more control over
how conversation context evolves.
LangChain has undergone significant architectural changes since its early days. The original monolithic langchain package has been split into langchain-core (stable interfaces), langchain-community (third-party integrations), and vendor-specific packages like langchain-openai. For complex agent workflows, langgraph is now the recommended approach over legacy chain classes. When reading tutorials or documentation, check the version carefully, as patterns from six months ago may already be deprecated.
3. LlamaIndex
LlamaIndex (formerly GPT Index) takes a data-centric approach to RAG. While LangChain provides general-purpose LLM application primitives, LlamaIndex focuses specifically on connecting LLMs with external data. Its core philosophy is that different data structures and query patterns require different index types, and the framework should help you choose and combine them.
3.1 Index Types
LlamaIndex offers several index types, each optimized for different query patterns. VectorStoreIndex is the most common, storing embeddings for semantic similarity search. SummaryIndex (formerly ListIndex) stores all nodes and iterates through them sequentially, useful when you need to process every document. TreeIndex builds a hierarchical tree of summaries, enabling top-down traversal for broad questions. KeywordTableIndex extracts keywords from each node and uses keyword matching for retrieval.
3.2 Query Engines and Response Synthesizers
A query engine in LlamaIndex combines a retriever with a response synthesizer. The retriever fetches relevant nodes (chunks), and the response synthesizer determines how to construct the final answer from those nodes. LlamaIndex provides several synthesis strategies: compact (stuff all context into one prompt), refine (iteratively refine the answer by processing one chunk at a time), and tree_summarize (recursively summarize groups of chunks in a tree structure). The choice of synthesizer affects both answer quality and token consumption.
Example 2: RAG Pipeline with LlamaIndex
from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, Settings, StorageContext ) from llama_index.core.node_parser import SentenceSplitter from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding # Configure global settings Settings.llm = OpenAI(model="gpt-4o", temperature=0) Settings.embed_model = OpenAIEmbedding(model_name="text-embedding-3-small") Settings.node_parser = SentenceSplitter( chunk_size=1024, chunk_overlap=200 ) # Load documents from a directory documents = SimpleDirectoryReader("./data").load_data() print(f"Loaded {len(documents)} documents") # Build the vector index (embeds and stores automatically) index = VectorStoreIndex.from_documents(documents) # Create a query engine with custom parameters query_engine = index.as_query_engine( similarity_top_k=5, response_mode="compact", # or "refine", "tree_summarize" streaming=True ) # Query with streaming response response = query_engine.query( "What are the key benefits of RAG?" ) # Stream the response response.print_response_stream() # Access source nodes for citations for node in response.source_nodes: print(f"\nSource: {node.metadata.get('file_name', 'unknown')}") print(f"Score: {node.score:.4f}") print(f"Text: {node.text[:200]}...")
3.3 Routers and Multi-Index Queries
One of LlamaIndex's distinctive features is its routing system. A RouterQueryEngine selects which sub-query engine to use based on the question. For example, you might route factual questions to a vector index, summary questions to a tree index, and comparison questions to a SQL query engine. This enables a single application to handle diverse query types by dispatching each question to the most appropriate retrieval strategy.
4. Haystack by deepset
Haystack takes a pipeline-first approach to NLP and RAG applications. Developed by deepset, it models every workflow as a directed graph of components. Each component has typed inputs and outputs, and pipelines are validated at construction time to ensure that component connections are compatible. This design philosophy emphasizes explicit data flow, type safety, and reproducibility.
4.1 Pipeline-Based Architecture
In Haystack, a pipeline is a directed acyclic graph (DAG) where each node is a component that performs a specific operation. Components declare their input and output types using Python dataclasses, and the pipeline validates that connected components have compatible types. This strict typing catches configuration errors at build time rather than at runtime, which is valuable for complex production pipelines with many components.
Example 3: RAG Pipeline with Haystack
from haystack import Pipeline from haystack.components.converters import TextFileToDocument from haystack.components.preprocessors import DocumentSplitter from haystack.components.embedders import ( SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder ) from haystack.components.writers import DocumentWriter from haystack.components.builders import PromptBuilder from haystack.components.generators import OpenAIGenerator from haystack_integrations.document_stores.chroma import ( ChromaDocumentStore ) from haystack_integrations.components.retrievers.chroma import ( ChromaEmbeddingRetriever ) # ---- Indexing Pipeline ---- document_store = ChromaDocumentStore() indexing_pipeline = Pipeline() indexing_pipeline.add_component("converter", TextFileToDocument()) indexing_pipeline.add_component("splitter", DocumentSplitter( split_by="sentence", split_length=3, split_overlap=1 )) indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder() ) indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store) ) # Connect components explicitly indexing_pipeline.connect("converter", "splitter") indexing_pipeline.connect("splitter", "embedder") indexing_pipeline.connect("embedder", "writer") # Run indexing indexing_pipeline.run({ "converter": {"sources": ["./data/doc1.txt", "./data/doc2.txt"]} }) # ---- Query Pipeline ---- template = """Given the following context, answer the question. Context: {% for doc in documents %} {{ doc.content }} {% endfor %} Question: {{ question }} Answer:""" query_pipeline = Pipeline() query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder() ) query_pipeline.add_component("retriever", ChromaEmbeddingRetriever(document_store=document_store) ) query_pipeline.add_component("prompt_builder", PromptBuilder(template=template) ) query_pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o")) query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") query_pipeline.connect("retriever.documents", "prompt_builder.documents") query_pipeline.connect("prompt_builder", "llm") # Run the query pipeline result = query_pipeline.run({ "text_embedder": {"text": "What are the key benefits of RAG?"}, "prompt_builder": {"question": "What are the key benefits of RAG?"} }) print(result["llm"]["replies"][0])
Haystack's explicit component wiring may feel verbose compared to LangChain's LCEL pipes, but it provides a major advantage for production systems: the pipeline graph can be serialized to YAML, versioned in Git, and reconstructed identically across environments. This makes Haystack pipelines highly reproducible and easy to audit, which matters in regulated industries where you must document exactly how your system processes data.
5. Framework Comparison
Each framework reflects a different philosophy about how RAG applications should be built. LangChain prioritizes breadth of integrations and developer velocity, LlamaIndex focuses on data-aware retrieval strategies, and Haystack emphasizes pipeline clarity and production robustness. The following table summarizes the key differences.
| Dimension | LangChain | LlamaIndex | Haystack |
|---|---|---|---|
| Philosophy | General-purpose LLM toolkit | Data-centric RAG framework | Pipeline-first NLP framework |
| Primary strength | Breadth of integrations (700+) | Index types and query routing | Type-safe pipeline composition |
| Composition model | LCEL pipes, LangGraph | Query engines, routers | DAG pipelines with typed I/O |
| Learning curve | Moderate (many concepts) | Lower for RAG tasks | Lower (explicit data flow) |
| Agent support | LangGraph (strong) | AgentRunner (growing) | Agent components (newer) |
| Production tooling | LangSmith tracing, LangServe | Observability callbacks | Hayhooks, pipeline YAML |
| Community size | Largest (90k+ GitHub stars) | Large (35k+ GitHub stars) | Growing (17k+ GitHub stars) |
| API stability | Frequent changes (improving) | More stable core API | Stable since Haystack 2.0 |
| Best for | Prototyping, diverse use cases | Complex data retrieval | Production NLP pipelines |
6. When to Use a Framework vs. Building from Scratch
Frameworks are not always the right choice. For simple RAG pipelines (embed, retrieve, generate), the overhead of learning and maintaining a framework may exceed the effort of writing the integration code yourself. The decision depends on several factors: pipeline complexity, team size, iteration speed requirements, and the need for component swapping.
6.1 Choose a Framework When
- You need rapid prototyping: Frameworks let you test different vector stores, embedding models, and retrieval strategies in hours instead of days.
- Your pipeline has many components: Once you need rerankers, query routers, hybrid search, or multi-step retrieval, the wiring code grows exponentially. Frameworks manage this complexity.
- Your team is growing: Frameworks provide a shared vocabulary and structure that makes onboarding easier and code reviews more productive.
- You want observability tooling: LangSmith, LlamaTrace, and Haystack's pipeline visualization provide tracing and debugging that would take weeks to build from scratch.
6.2 Build from Scratch When
- Your pipeline is simple and stable: If you know you are using OpenAI embeddings, Pinecone, and GPT-4o, and this will not change, direct API calls are simpler and faster.
- Performance is critical: Frameworks add latency overhead (typically 5 to 50ms per component call). For latency-sensitive applications, direct API calls eliminate this overhead.
- You need deep customization: If your retrieval logic requires custom scoring functions, specialized chunk merging, or non-standard pipeline patterns, fighting a framework's abstractions can be harder than building your own.
- You want minimal dependencies: Frameworks pull in dozens of transitive dependencies. For lightweight deployments (Lambda functions, edge computing), a minimal implementation is often preferable.
Example 4: Minimal RAG Without a Framework
import openai import chromadb # Direct API calls: no framework needed client = openai.OpenAI() chroma = chromadb.PersistentClient(path="./chroma_db") collection = chroma.get_or_create_collection( name="docs", metadata={"hnsw:space": "cosine"} ) def embed(text: str) -> list[float]: """Get embedding from OpenAI.""" response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding def retrieve(query: str, k: int = 5) -> list[str]: """Retrieve top-k relevant documents.""" results = collection.query( query_embeddings=[embed(query)], n_results=k ) return results["documents"][0] def generate(query: str, context_docs: list[str]) -> str: """Generate answer using retrieved context.""" context = "\n\n".join(context_docs) response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": ( "Answer based on the provided context. " "If the context is insufficient, say so." )}, {"role": "user", "content": ( f"Context:\n{context}\n\nQuestion: {query}" )} ], temperature=0 ) return response.choices[0].message.content # The entire RAG pipeline in three function calls query = "What are the key benefits of RAG?" docs = retrieve(query) answer = generate(query, docs) print(answer)
Be cautious about deep framework coupling. If you use LangChain's custom prompt classes, LlamaIndex's specialized node postprocessors, and framework-specific serialization formats throughout your codebase, migrating to a different framework (or to raw API calls) becomes expensive. As a safeguard, keep your core business logic in plain Python functions that accept and return standard types (strings, lists, dictionaries). Use the framework for orchestration and wiring, not for your domain logic. This layered approach lets you swap the orchestration layer without rewriting your retrieval and generation logic.
7. Lab: Comparing Frameworks Side by Side
The best way to evaluate frameworks is to implement the same pipeline in each one and compare the developer experience. In this lab, we build identical RAG pipelines in LangChain and LlamaIndex, then measure lines of code, setup complexity, retrieval quality, and response latency.
Example 5: Side-by-Side Comparison Test Harness
import time import json # --- LangChain Implementation --- def build_langchain_rag(docs_path: str): from langchain_community.document_loaders import DirectoryLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_community.vectorstores import Chroma from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser from langchain_core.runnables import RunnablePassthrough # Load and split loader = DirectoryLoader(docs_path, glob="**/*.txt") splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=200 ) chunks = splitter.split_documents(loader.load()) # Index vectorstore = Chroma.from_documents( chunks, OpenAIEmbeddings(model="text-embedding-3-small") ) # Build chain template = """Context: {context}\n\nQuestion: {question}\nAnswer:""" chain = ( { "context": vectorstore.as_retriever(search_kwargs={"k": 5}) | (lambda docs: "\n".join(d.page_content for d in docs)), "question": RunnablePassthrough() } | ChatPromptTemplate.from_template(template) | ChatOpenAI(model="gpt-4o", temperature=0) | StrOutputParser() ) return chain # --- LlamaIndex Implementation --- def build_llamaindex_rag(docs_path: str): from llama_index.core import ( VectorStoreIndex, SimpleDirectoryReader, Settings ) from llama_index.core.node_parser import SentenceSplitter from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding Settings.llm = OpenAI(model="gpt-4o", temperature=0) Settings.embed_model = OpenAIEmbedding( model_name="text-embedding-3-small" ) Settings.node_parser = SentenceSplitter( chunk_size=1024, chunk_overlap=200 ) documents = SimpleDirectoryReader(docs_path).load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine(similarity_top_k=5) return query_engine # --- Comparison --- test_questions = [ "What are the main components of a RAG system?", "How does hybrid search improve retrieval?", "What are best practices for chunking documents?", ] docs_path = "./test_data" # Build both pipelines lc_chain = build_langchain_rag(docs_path) li_engine = build_llamaindex_rag(docs_path) results = [] for question in test_questions: # LangChain timing start = time.time() lc_answer = lc_chain.invoke(question) lc_time = time.time() - start # LlamaIndex timing start = time.time() li_answer = li_engine.query(question) li_time = time.time() - start results.append({ "question": question, "langchain_time": round(lc_time, 3), "llamaindex_time": round(li_time, 3), "langchain_answer_len": len(lc_answer), "llamaindex_answer_len": len(str(li_answer)), }) # Summary print("Framework Comparison Results:") print(json.dumps(results, indent=2)) avg_lc = sum(r["langchain_time"] for r in results) / len(results) avg_li = sum(r["llamaindex_time"] for r in results) / len(results) print(f"\nAvg LangChain latency: {avg_lc:.3f}s") print(f"Avg LlamaIndex latency: {avg_li:.3f}s")
To deepen your comparison, try these extensions: (1) Add a Haystack implementation as a third pipeline and compare all three. (2) Swap the vector store from Chroma to FAISS or Pinecone and measure how much framework code changes in each case. (3) Add a reranker step (such as Cohere Rerank) to each pipeline and compare the integration effort. (4) Test with larger document sets (1,000+ documents) to measure indexing performance differences. (5) Evaluate answer quality using an LLM judge that scores relevance and completeness for each framework's output.
8. Production Considerations
Moving a framework-based RAG pipeline from prototype to production introduces additional requirements: observability, error handling, caching, rate limiting, and deployment packaging. Each framework addresses these concerns differently.
8.1 Observability and Tracing
LangChain offers LangSmith, a hosted tracing platform that records every step of your pipeline (retriever calls, LLM requests, latency breakdowns). LlamaIndex provides callback handlers and integrations with observability platforms like Arize and Weights & Biases. Haystack pipelines can export their graph structure as YAML, making it straightforward to visualize and audit the processing flow. Regardless of framework, production RAG systems should log the query, retrieved documents, generated answer, and latency for every request.
8.2 Error Handling and Fallbacks
Production pipelines must handle failures gracefully. Common failure modes include embedding API timeouts, vector store connection errors, and LLM rate limiting. Frameworks provide varying levels of built-in retry logic. LangChain supports configurable retry with exponential backoff on all Runnable components. LlamaIndex provides retry logic through its service context. Haystack lets you define fallback components in the pipeline graph. For any framework, you should also implement application-level fallbacks (returning cached results, falling back to a simpler model, or showing a helpful error message).
8.3 Caching Strategies
Embedding computation and LLM calls are the most expensive operations in a RAG pipeline. Caching these results can dramatically reduce both cost and latency. All three frameworks support caching at multiple levels: embedding caches (avoid re-embedding identical text), retrieval caches (return the same documents for identical queries), and LLM caches (return the same answer for identical prompts). For production systems, Redis or a similar distributed cache is recommended over in-memory caching to support horizontal scaling.
The most pragmatic approach to framework adoption is the "prototype with, produce without" pattern. Use a framework during the exploration phase to rapidly test different retrieval strategies, embedding models, and LLM configurations. Once you have identified the winning combination, evaluate whether the framework's abstractions are still earning their keep. For simple, stable pipelines, replacing the framework layer with direct API calls often yields a faster, more maintainable system. For complex pipelines with many components, the framework's orchestration value usually justifies its continued use.
Section 19.6 Quiz
Show Answer
invoke, stream, batch, and ainvoke work automatically without extra code. Manually wired chains require implementing these capabilities separately for each pipeline variant.Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Frameworks accelerate development, not replace understanding: LangChain, LlamaIndex, and Haystack automate wiring and integration, but you still need to understand embedding, retrieval, and generation fundamentals to debug and optimize your pipeline.
- LangChain excels at breadth and prototyping: With 700+ integrations and LCEL's composable Runnables, LangChain is the fastest path from idea to working prototype, especially for diverse use cases beyond pure RAG.
- LlamaIndex is purpose-built for data retrieval: Its variety of index types, query routing, and response synthesis strategies make it the strongest choice when your primary challenge is connecting LLMs with complex, heterogeneous data sources.
- Haystack prioritizes production reliability: Type-safe pipelines, YAML serialization, and explicit component wiring make Haystack well-suited for teams that need reproducible, auditable, and maintainable production systems.
- Simple pipelines often do not need a framework: For straightforward embed, retrieve, generate workflows with stable component choices, direct API calls are simpler, faster, and easier to maintain than framework abstractions.
- Guard against framework lock-in: Keep domain logic in plain Python functions that accept standard types. Use frameworks for orchestration only, so you can swap or remove the framework without rewriting business logic.