Section 18.3: Vector Database Systems

★ Big Picture

A vector database is more than an ANN index with an API. Production vector search requires persistent storage, metadata filtering, access control, horizontal scaling, real-time updates, and monitoring. The vector database ecosystem has expanded rapidly, offering options that range from fully managed cloud services to embedded libraries that run in-process. Choosing the right solution depends on your scale, infrastructure, operational maturity, and whether you need features like hybrid search, multi-tenancy, or built-in reranking. This section provides a practical, comparative guide to the major systems.

1. Vector Database Architecture

A vector database extends the ANN index algorithms covered in Section 18.2 with the operational features needed for production deployments. The core architectural components include:

Storage engine: Persists vectors and metadata to disk with write-ahead logging for durability. Must support efficient bulk loading and incremental updates.
Index manager: Builds and maintains ANN indexes (HNSW, IVF, etc.), handles index rebuilds, and manages index parameters.
Query engine: Processes search requests, combining vector similarity with metadata filters and optional reranking.
API layer: Exposes gRPC and/or REST endpoints for CRUD operations, search, and administration.
Distributed coordinator: (For distributed systems) Manages sharding, replication, consistency, and load balancing across nodes.

Figure 18.6: Core architecture of a production vector database system.

2. Purpose-Built Vector Databases

Pinecone

Pinecone is a fully managed, cloud-native vector database. It eliminates operational overhead by handling infrastructure, scaling, and index management automatically. Pinecone provides serverless and pod-based deployment options, with serverless being the more cost-effective choice for workloads with variable traffic patterns.

# Pinecone: managed vector database
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")

# Create a serverless index
pc.create_index(
    name="document-search",
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

index = pc.Index("document-search")

# Upsert vectors with metadata
vectors = [
    {
        "id": "doc-001",
        "values": [0.12, -0.34, 0.56, ...],  # 768-dim embedding
        "metadata": {
            "source": "annual-report-2024.pdf",
            "page": 15,
            "category": "financial",
            "date": "2024-03-15"
        }
    },
    {
        "id": "doc-002",
        "values": [0.45, 0.23, -0.78, ...],
        "metadata": {
            "source": "product-manual.pdf",
            "page": 42,
            "category": "technical",
            "date": "2024-01-20"
        }
    }
]
index.upsert(vectors=vectors)

# Search with metadata filtering
results = index.query(
    vector=[0.11, -0.32, 0.54, ...],
    top_k=5,
    filter={
        "category": {"$eq": "financial"},
        "date": {"$gte": "2024-01-01"}
    },
    include_metadata=True
)

for match in results["matches"]:
    print(f"Score: {match['score']:.4f}, "
          f"Source: {match['metadata']['source']}, "
          f"Page: {match['metadata']['page']}")

Qdrant

Qdrant is an open-source vector database written in Rust, designed for high performance and operational flexibility. It supports both self-hosted and cloud-managed deployments. Qdrant's key differentiators include rich payload filtering with indexed fields, built-in support for sparse vectors (enabling hybrid search natively), and quantization options (scalar, product, and binary) for memory optimization.

# Qdrant: high-performance open-source vector database
from qdrant_client import QdrantClient
from qdrant_client.models import (
    Distance, VectorParams, PointStruct,
    Filter, FieldCondition, MatchValue,
    NamedVector, SparseVector, SparseVectorParams,
    SparseIndexParams,
)

client = QdrantClient(url="http://localhost:6333")

# Create collection with both dense and sparse vectors (hybrid search)
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(size=768, distance=Distance.COSINE),
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams(
            index=SparseIndexParams(on_disk=False)
        ),
    },
)

# Upsert points with dense vectors, sparse vectors, and payload
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector={
                "dense": [0.12, -0.34, 0.56] + [0.0] * 765,
                "sparse": SparseVector(
                    indices=[102, 507, 1024, 3891],
                    values=[0.8, 0.6, 0.9, 0.3]
                ),
            },
            payload={
                "title": "Introduction to RAG Systems",
                "category": "technical",
                "word_count": 2500,
            },
        ),
    ],
)

# Hybrid search: combine dense and sparse retrieval
results = client.query_points(
    collection_name="documents",
    prefetch=[
        # Dense vector search
        {"query": [0.11, -0.32, 0.54] + [0.0] * 765, "using": "dense", "limit": 20},
        # Sparse vector search (BM25-style)
        {"query": SparseVector(indices=[102, 1024], values=[0.9, 0.7]),
         "using": "sparse", "limit": 20},
    ],
    # Reciprocal Rank Fusion to merge results
    query={"fusion": "rrf"},
    limit=10,
)

for point in results.points:
    print(f"ID: {point.id}, Score: {point.score:.4f}")

Weaviate

Weaviate is an open-source vector database that integrates embedding generation directly into its query pipeline. Through its module system, Weaviate can automatically vectorize text at ingestion and query time using built-in integrations with OpenAI, Cohere, Hugging Face, and other providers. This simplifies the development workflow by eliminating the need to manage embedding generation separately.

Milvus

Milvus is an open-source distributed vector database designed for billion-scale workloads. Its disaggregated architecture separates storage and compute, allowing independent scaling of query nodes, data nodes, and index nodes. Milvus supports the widest range of index types (HNSW, IVF-Flat, IVF-PQ, IVF-SQ8, DiskANN, GPU indexes) and provides strong consistency guarantees through a log-based architecture.

3. Lightweight and Embedded Solutions

ChromaDB

ChromaDB is an open-source embedding database designed for simplicity and rapid prototyping. It runs in-process (embedded mode) or as a lightweight server, making it the most popular choice for local development, tutorials, and small-scale applications. ChromaDB handles embedding generation automatically when configured with an embedding function.

# ChromaDB: lightweight embedded vector database
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# Initialize with persistent storage
client = chromadb.PersistentClient(path="./chroma_data")

# Use Sentence Transformers for automatic embedding
embedding_fn = SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create or get collection
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}
)

# Add documents (embeddings generated automatically)
collection.add(
    documents=[
        "RAG combines retrieval with language model generation.",
        "Vector databases store and index high-dimensional embeddings.",
        "HNSW provides fast approximate nearest neighbor search.",
        "Chunking strategies affect retrieval quality significantly.",
    ],
    metadatas=[
        {"topic": "rag", "level": "intro"},
        {"topic": "vector-db", "level": "intro"},
        {"topic": "algorithms", "level": "advanced"},
        {"topic": "preprocessing", "level": "intermediate"},
    ],
    ids=["doc1", "doc2", "doc3", "doc4"]
)

# Query with automatic embedding and metadata filtering
results = collection.query(
    query_texts=["How does semantic search work?"],
    n_results=3,
    where={"level": {"$ne": "advanced"}}
)

for doc, distance, metadata in zip(
    results["documents"][0],
    results["distances"][0],
    results["metadatas"][0]
):
    print(f"Distance: {distance:.4f} | {metadata['topic']}: {doc[:60]}...")

Distance: 0.4521 | vector-db: Vector databases store and index high-dimensional embed... Distance: 0.5837 | rag: RAG combines retrieval with language model generation... Distance: 0.7124 | preprocessing: Chunking strategies affect retrieval quality significan...

FAISS as a Library

FAISS (Facebook AI Similarity Search) is not a database but a library for building and querying ANN indexes. It provides the fastest index implementations available, with GPU acceleration for both index building and search. FAISS is the right choice when you need maximum search performance, have a static or infrequently updated dataset, and are willing to handle persistence and metadata filtering yourself.

LanceDB

LanceDB is a serverless, embedded vector database built on the Lance columnar format. Its distinguishing feature is storing vectors alongside structured data in a single table, much like a traditional database that happens to support vector search. LanceDB supports automatic versioning of data, zero-copy access through memory-mapped files, and integration with the broader data ecosystem through Apache Arrow compatibility.

pgvector

pgvector is a PostgreSQL extension that adds vector similarity search capabilities to the world's most popular open-source relational database. For teams already running PostgreSQL, pgvector eliminates the need to introduce and operate a separate vector database. It supports HNSW and IVF-Flat indexes and benefits from PostgreSQL's mature ecosystem of tooling, replication, and backup solutions.

# pgvector: vector search in PostgreSQL
import psycopg2
import numpy as np

conn = psycopg2.connect("postgresql://user:pass@localhost/mydb")
cur = conn.cursor()

# Enable the extension and create table
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        content TEXT,
        category TEXT,
        embedding vector(768)
    );
""")

# Create HNSW index for cosine distance
cur.execute("""
    CREATE INDEX IF NOT EXISTS documents_embedding_idx
    ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 200);
""")

# Insert documents with embeddings
sample_embedding = np.random.randn(768).astype(np.float32)
sample_embedding /= np.linalg.norm(sample_embedding)

cur.execute(
    """INSERT INTO documents (content, category, embedding)
       VALUES (%s, %s, %s)""",
    ("Vector databases enable semantic search.", "technical",
     sample_embedding.tolist())
)
conn.commit()

# Semantic search with SQL: combine vector similarity with standard filters
query_embedding = np.random.randn(768).astype(np.float32)
query_embedding /= np.linalg.norm(query_embedding)

cur.execute("""
    SELECT id, content, category,
           1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    WHERE category = 'technical'
    ORDER BY embedding <=> %s::vector
    LIMIT 5;
""", (query_embedding.tolist(), query_embedding.tolist()))

for row in cur.fetchall():
    print(f"ID: {row[0]}, Similarity: {row[3]:.4f}, Content: {row[1][:50]}...")

cur.close()
conn.close()

📘 When to Use pgvector

pgvector is an excellent choice when: (1) you already operate PostgreSQL and want to minimize infrastructure complexity; (2) your vector collection is under 10 million items; (3) you need transactional consistency between vector data and relational data; or (4) your queries combine vector similarity with complex SQL filters and joins. For collections above 10 million vectors or workloads requiring sub-millisecond latency at scale, a purpose-built vector database will generally outperform pgvector.

4. Comparison Matrix

System	Type	Language	Managed Cloud	Hybrid Search	Best For
Pinecone	Managed DB	N/A (SaaS)	Yes (only)	Yes	Zero-ops production
Qdrant	Open-source DB	Rust	Yes	Yes (native)	High performance, flexibility
Weaviate	Open-source DB	Go	Yes	Yes	Built-in vectorization
Milvus	Open-source DB	Go/C++	Yes (Zilliz)	Yes	Billion-scale distributed
ChromaDB	Embedded DB	Python	No	No	Prototyping, small scale
FAISS	Library	C++/Python	No	No	Maximum raw performance
LanceDB	Embedded DB	Rust	Yes	Yes	Data-native workflows
pgvector	Extension	C	Via PG hosts	Via SQL	Existing PostgreSQL stacks

5. Metadata Filtering

Real-world retrieval rarely uses pure vector similarity. Most queries include metadata constraints such as date ranges, document categories, access permissions, or language filters. The efficiency of metadata filtering varies significantly across systems.

Pre-filtering vs. Post-filtering

Pre-filtering: Apply metadata filters first, then search only within the matching subset. This is efficient when the filter is highly selective (eliminates most vectors) but can degrade ANN quality if the filtered subset is small relative to the index structure.
Post-filtering: Perform the ANN search first, then remove results that do not match the metadata filter. This preserves ANN quality but may return fewer results than requested if many top results are filtered out.
Integrated filtering: The most sophisticated approach interleaves filtering with the ANN search algorithm. Qdrant and Weaviate implement this by checking metadata predicates during graph traversal, combining the benefits of both approaches.

⚠ The Filtering Trap

Metadata filtering can silently degrade retrieval quality. If your filter eliminates 99% of vectors and you are using post-filtering, a search for top-10 results might return only 1 or 2 matches. If you are using pre-filtering with an HNSW index built on the full dataset, the search may miss relevant results because the graph connectivity has been disrupted. Always monitor the number of results returned and the score distribution to detect filtering-related quality issues.

6. Hybrid Search and Reciprocal Rank Fusion

Hybrid search combines dense vector retrieval (semantic similarity) with sparse keyword retrieval (lexical matching, typically BM25). This addresses a fundamental limitation of pure semantic search: embedding models may miss exact keyword matches that are critical for some queries, especially those involving proper nouns, product codes, or technical terminology.

Reciprocal Rank Fusion (RRF)

RRF is the most common method for merging ranked result lists from different retrieval systems. For each document, RRF computes a fused score based on its rank position in each result list, using the formula: score = sum(1 / (k + rank_i)), where k is a constant (typically 60) and rank_i is the document's position in the i-th result list. Documents that appear near the top of multiple lists receive the highest fused scores.

# Reciprocal Rank Fusion implementation
from typing import Dict, List, Tuple

def reciprocal_rank_fusion(
    result_lists: List[List[str]],
    k: int = 60
) -> List[Tuple[str, float]]:
    """
    Merge multiple ranked result lists using RRF.

    Args:
        result_lists: List of ranked document ID lists
        k: RRF constant (default 60, as per the original paper)

    Returns:
        List of (doc_id, fused_score) tuples, sorted by fused score
    """
    fused_scores: Dict[str, float] = {}

    for result_list in result_lists:
        for rank, doc_id in enumerate(result_list, start=1):
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0.0
            fused_scores[doc_id] += 1.0 / (k + rank)

    # Sort by fused score descending
    sorted_results = sorted(
        fused_scores.items(),
        key=lambda x: x[1],
        reverse=True
    )
    return sorted_results

# Example: merge dense and sparse retrieval results
dense_results = ["doc_A", "doc_B", "doc_C", "doc_D", "doc_E"]
sparse_results = ["doc_C", "doc_F", "doc_A", "doc_G", "doc_B"]

fused = reciprocal_rank_fusion([dense_results, sparse_results])
print("Hybrid search results (RRF):")
for doc_id, score in fused[:5]:
    print(f"  {doc_id}: {score:.6f}")

Hybrid search results (RRF): doc_A: 0.032787 doc_C: 0.032787 doc_B: 0.032258 doc_D: 0.015625 doc_F: 0.016129

🔑 Key Insight: When Hybrid Search Helps

Hybrid search provides the largest improvement over pure dense retrieval when queries contain specific entities (company names, product IDs, error codes) or when the embedding model was not trained on your domain vocabulary. In benchmarks, hybrid search with RRF typically improves recall@10 by 5 to 15% over dense-only retrieval. The improvement is smallest for broad, conceptual queries where semantic matching already excels.

7. Operational Considerations

Scaling Patterns

Vertical scaling: Add more RAM and faster CPUs to a single node. This is the simplest approach and works well up to approximately 10 to 50 million vectors, depending on dimensionality and quantization.
Sharding: Distribute vectors across multiple nodes based on a partition key. Each shard handles a subset of the data. Queries are fanned out to all shards and results are merged.
Replication: Copy each shard to multiple nodes for fault tolerance and read throughput. Read queries can be load-balanced across replicas.

Cost Optimization

Vector databases can become expensive at scale because they require significant memory. Key strategies for cost reduction include: using quantization (scalar, product, or binary) to reduce memory per vector; using disk-based indexes (DiskANN) for cold data; implementing tiered storage where frequently accessed data stays in memory while archival data resides on SSD; and choosing serverless pricing models that scale to zero during idle periods.

Figure 18.7: Simplified decision tree for vector database selection based on scale and operational requirements.

Section 18.3 Quiz

1. What distinguishes a vector database from a vector search library like FAISS?

Show Answer

A vector database provides production features beyond ANN search, including persistent storage with durability guarantees, metadata filtering, CRUD operations with real-time updates, access control, horizontal scaling (sharding and replication), monitoring, and a client API. FAISS provides only the core indexing and search algorithms. You would need to build persistence, filtering, updates, and API layers yourself when using FAISS directly.

2. What is the difference between pre-filtering and post-filtering in vector search?

Show Answer

Pre-filtering applies metadata constraints before the ANN search, restricting the search to only matching vectors. This is efficient for highly selective filters but can degrade search quality if the filtered subset is small relative to the index structure. Post-filtering performs the full ANN search first, then removes non-matching results. This preserves ANN quality but may return fewer results than requested. Integrated filtering (used by Qdrant and Weaviate) checks metadata predicates during graph traversal to combine the benefits of both approaches.

3. How does Reciprocal Rank Fusion (RRF) combine results from multiple retrieval systems?

Show Answer

RRF assigns each document a fused score based on its rank position in each result list using the formula: score = sum(1/(k + rank_i)), where k is a constant (typically 60) and rank_i is the document's rank in the i-th list. This approach is score-agnostic (it uses rank positions, not raw scores), which makes it robust to different score scales from different retrieval methods. Documents appearing near the top of multiple lists receive the highest fused scores.

4. When is pgvector a better choice than a purpose-built vector database?

Show Answer

pgvector is preferred when you already operate PostgreSQL and want to minimize infrastructure complexity, when your collection is under 10 million vectors, when you need transactional consistency between vector data and relational data (e.g., joining vector search results with user tables), or when your queries combine vector similarity with complex SQL predicates and joins. It leverages PostgreSQL's mature backup, replication, and monitoring ecosystem.

5. Why does hybrid search (dense + sparse) outperform pure dense retrieval for queries with specific entities?

Show Answer

Dense embedding models compress text into fixed-size vectors that capture semantic meaning but may lose exact lexical information. When a query contains a specific entity like a product name ("iPhone 15 Pro Max") or error code ("ERR_CONNECTION_REFUSED"), the embedding may not preserve these exact terms as distinct signals. BM25 or other sparse retrieval methods excel at exact keyword matching because they operate directly on term frequencies. Combining both via hybrid search ensures that both semantic similarity and exact keyword matches contribute to the final ranking.

Key Takeaways

Vector databases add production features (persistence, filtering, scaling, APIs) on top of ANN algorithms, going well beyond raw index performance.
Pinecone offers zero-ops managed service; Qdrant and Milvus provide high-performance open-source alternatives with cloud options.
ChromaDB is ideal for prototyping and small-scale applications; FAISS delivers maximum raw performance as a library.
pgvector is the pragmatic choice for teams already running PostgreSQL with collections under 10M vectors.
Metadata filtering strategy (pre-filter, post-filter, or integrated) significantly affects both result quality and query latency.
Hybrid search with RRF typically improves recall by 5 to 15% over dense-only retrieval, especially for entity-rich queries.
Start simple, scale up. Begin with ChromaDB or pgvector for prototyping, then migrate to a purpose-built solution when scale or feature requirements demand it.