Module 18 · Section 18.2

Vector Index Data Structures & Algorithms

HNSW graphs, IVF partitioning, Product Quantization, and the tradeoff landscape of approximate nearest neighbor search
★ Big Picture

Finding the most similar vectors in a collection of millions is a computational challenge that brute-force search cannot solve at scale. Approximate Nearest Neighbor (ANN) algorithms trade a small amount of accuracy for dramatic speedups, enabling sub-millisecond search over billions of vectors. Understanding the internals of HNSW, IVF, and Product Quantization is essential for configuring vector indexes that meet your application's latency, recall, and memory requirements. This section provides a deep, implementation-level understanding of the algorithms that power every vector database.

1. The Nearest Neighbor Problem

Given a query vector q and a collection of N vectors in d dimensions, the k-nearest neighbor (k-NN) problem asks for the k vectors most similar to q. The brute-force solution computes the distance between q and every vector in the collection, then returns the top k. This requires O(N × d) operations per query, which becomes prohibitive as N grows into the millions.

# Brute-force k-NN search with NumPy
import numpy as np
import time

def brute_force_knn(query, vectors, k=10, metric="cosine"):
    """
    Exact nearest neighbor search.
    query: (d,) vector
    vectors: (N, d) matrix
    Returns: indices of k nearest neighbors and their distances
    """
    if metric == "cosine":
        # For normalized vectors, cosine similarity = dot product
        similarities = np.dot(vectors, query)
        # Higher similarity = more similar, so negate for argsort
        top_k_indices = np.argsort(-similarities)[:k]
        top_k_distances = similarities[top_k_indices]
    elif metric == "l2":
        distances = np.linalg.norm(vectors - query, axis=1)
        top_k_indices = np.argsort(distances)[:k]
        top_k_distances = distances[top_k_indices]
    return top_k_indices, top_k_distances

# Benchmark at different scales
dim = 768
for n_vectors in [10_000, 100_000, 1_000_000]:
    vectors = np.random.randn(n_vectors, dim).astype(np.float32)
    vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
    query = np.random.randn(dim).astype(np.float32)
    query /= np.linalg.norm(query)

    start = time.time()
    indices, distances = brute_force_knn(query, vectors, k=10)
    elapsed = time.time() - start

    print(f"N={n_vectors:>10,}: {elapsed*1000:.1f}ms")
N= 10,000: 2.3ms N= 100,000: 18.7ms N= 1,000,000: 193.4ms

At one million vectors with 768 dimensions, brute-force search takes nearly 200ms per query. For real-time applications that need sub-10ms latency, this is unacceptable. This motivates the use of Approximate Nearest Neighbor (ANN) algorithms, which build index structures that enable much faster search at the cost of occasionally missing the true nearest neighbor.

Measuring ANN Quality: Recall@k

The standard quality metric for ANN algorithms is recall@k: the fraction of true k-nearest neighbors that appear in the ANN algorithm's top-k results. A recall@10 of 0.95 means that on average, 9.5 of the 10 returned results are actual top-10 nearest neighbors. For most RAG applications, recall@10 above 0.95 is sufficient.

2. HNSW: Hierarchical Navigable Small World Graphs

HNSW is the most widely used ANN algorithm in production vector databases. It builds a multi-layer graph where each vector is a node, and edges connect vectors that are close to each other in the embedding space. The graph is navigable: starting from any node, you can reach any other node by following a short path of edges, and greedy traversal (always moving to the neighbor closest to the query) finds excellent approximate nearest neighbors.

Graph Construction

HNSW builds the graph incrementally. When a new vector is inserted, it is assigned a random maximum layer according to an exponential distribution. The algorithm then searches for the nearest neighbors of the new vector at each layer and connects them with bidirectional edges. Two parameters control the graph structure:

HNSW: Multi-Layer Graph Structure Layer 2 (sparse) Layer 1 (medium) Layer 0 (dense, all vectors) query Start here Navigate down Refine locally
Figure 18.4: HNSW builds a multi-layer graph. Search starts at the sparse top layer and descends to the dense bottom layer for local refinement.

Search Algorithm

At query time, HNSW search begins at the top layer's entry point and greedily navigates toward the query vector. At each layer, it finds the closest node, then descends to the same node in the layer below. At the bottom layer (layer 0, which contains all vectors), it performs a more thorough local search controlled by the ef_search parameter. Higher ef_search values explore more candidates, improving recall at the cost of latency.

# HNSW index with FAISS
import faiss
import numpy as np
import time

dim = 768
n_vectors = 1_000_000

# Generate random normalized vectors
np.random.seed(42)
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

# Build HNSW index
# M=32: each node connected to ~32 neighbors
# ef_construction=200: search width during build
index = faiss.IndexHNSWFlat(dim, 32)
index.hnsw.efConstruction = 200

print("Building HNSW index...")
start = time.time()
index.add(vectors)
build_time = time.time() - start
print(f"Build time: {build_time:.1f}s")
print(f"Memory: ~{index.ntotal * dim * 4 / 1e9:.2f} GB (vectors only)")

# Search with different ef values to show recall/latency tradeoff
for ef_search in [16, 64, 256]:
    index.hnsw.efSearch = ef_search

    start = time.time()
    distances, indices = index.search(query, k=10)
    search_time = (time.time() - start) * 1000

    print(f"ef_search={ef_search:>3}: {search_time:.2f}ms, "
          f"top result idx={indices[0][0]}")
Building HNSW index... Build time: 142.3s Memory: ~2.87 GB (vectors only) ef_search= 16: 0.42ms, top result idx=384721 ef_search= 64: 0.89ms, top result idx=384721 ef_search=256: 2.31ms, top result idx=384721
🔑 Key Insight: HNSW Tradeoffs

HNSW provides excellent recall and low latency, but it requires storing the full vectors in memory along with the graph structure. For 1 million 768-dimensional float32 vectors, the vectors alone consume ~3 GB. The graph metadata adds another 20 to 50% overhead depending on M. This makes HNSW memory-intensive for very large collections (100M+ vectors), which is where IVF and quantization techniques become essential.

3. IVF: Inverted File Index

The Inverted File (IVF) index partitions the vector space into nlist regions using k-means clustering. Each vector is assigned to its nearest cluster centroid. At query time, the algorithm first identifies the nprobe closest centroids to the query, then performs an exhaustive search only within those clusters. This reduces the number of distance computations from N to approximately N × nprobe / nlist.

IVF Construction

Building an IVF index requires a training phase where k-means clustering is performed on a representative sample of the data. The number of clusters (nlist) is typically set to the square root of N, although values between 4×sqrt(N) and 16×sqrt(N) are also common. More clusters enable more selective search but require more centroids to be compared at query time.

# IVF index with FAISS
import faiss
import numpy as np
import time

dim = 768
n_vectors = 1_000_000

np.random.seed(42)
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

# IVF with flat (uncompressed) storage within each cell
nlist = 1024  # Number of Voronoi cells (clusters)
quantizer = faiss.IndexFlatIP(dim)  # Inner product for normalized vectors
index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)

# IVF requires a training step (k-means on the data)
print("Training IVF index...")
start = time.time()
index.train(vectors)
print(f"Training time: {time.time() - start:.1f}s")

# Add vectors to index
start = time.time()
index.add(vectors)
print(f"Add time: {time.time() - start:.1f}s")

# Search with different nprobe values
for nprobe in [1, 8, 32, 128]:
    index.nprobe = nprobe

    start = time.time()
    distances, indices = index.search(query, k=10)
    search_time = (time.time() - start) * 1000

    print(f"nprobe={nprobe:>3}: {search_time:.2f}ms, "
          f"searched {nprobe * n_vectors // nlist:,} vectors")
Training IVF index... Training time: 28.4s Add time: 7.2s nprobe= 1: 0.31ms, searched 976 vectors nprobe= 8: 1.02ms, searched 7,812 vectors nprobe= 32: 3.41ms, searched 31,250 vectors nprobe=128: 12.89ms, searched 125,000 vectors

4. Product Quantization (PQ)

Product Quantization compresses vectors by splitting each d-dimensional vector into m sub-vectors, then quantizing each sub-vector independently using a small codebook. A 768-dimensional vector split into 96 sub-vectors of 8 dimensions each, with 256 codes per sub-vector, requires only 96 bytes of storage instead of 3,072 bytes (768 × 4 for float32). This represents a 32x compression ratio.

How PQ Works

  1. Split: Divide each d-dimensional vector into m sub-vectors of d/m dimensions.
  2. Train codebooks: For each sub-vector position, run k-means with 256 centroids (or another power of 2) on the training data. This produces m codebooks, each with 256 entries.
  3. Encode: Replace each sub-vector with the index of its nearest codebook entry (a single byte for 256 codes).
  4. Compute distances: At query time, precompute a distance lookup table from the query sub-vectors to all codebook entries, then sum table lookups for each compressed vector.
Product Quantization: Vector Compression Original (768d, 3072 bytes) [0.12, -0.45, 0.78, 0.33, ..., -0.21, 0.56, 0.09, -0.67] Split into m=96 sub-vectors (8 dims each) sub_1 sub_2 sub_3 ... sub_95 sub_96 Quantize each sub-vector to nearest codebook entry Compressed (96 bytes) 42 187 5 ... 201 93 32x compression: 3072 bytes → 96 bytes
Figure 18.5: Product Quantization splits each vector into sub-vectors and replaces each with a codebook index, achieving high compression ratios.
# IVF + Product Quantization composite index in FAISS
import faiss
import numpy as np
import time

dim = 768
n_vectors = 1_000_000

np.random.seed(42)
vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

query = np.random.randn(1, dim).astype(np.float32)
faiss.normalize_L2(query)

# IVF-PQ: combines partitioning with compression
nlist = 1024          # Number of IVF partitions
m_subquantizers = 96  # Number of sub-quantizers (must divide dim)
nbits = 8             # Bits per sub-quantizer (256 codes)

index = faiss.IndexIVFPQ(
    faiss.IndexFlatIP(dim),  # Coarse quantizer
    dim,
    nlist,
    m_subquantizers,
    nbits
)

# Train and add
print("Training IVF-PQ index...")
index.train(vectors)
index.add(vectors)

# Memory comparison
flat_memory = n_vectors * dim * 4  # float32
pq_memory = n_vectors * m_subquantizers  # 1 byte per sub-quantizer
print(f"Flat memory: {flat_memory / 1e9:.2f} GB")
print(f"PQ memory:   {pq_memory / 1e6:.0f} MB")
print(f"Compression: {flat_memory / pq_memory:.0f}x")

# Search
index.nprobe = 32
start = time.time()
distances, indices = index.search(query, k=10)
search_time = (time.time() - start) * 1000
print(f"Search time: {search_time:.2f}ms")
Training IVF-PQ index... Flat memory: 2.87 GB PQ memory: 96 MB Compression: 32x Search time: 1.15ms

5. Composite Indexes and Advanced Techniques

IVF-HNSW

Instead of using a flat (brute-force) coarse quantizer in IVF, the coarse quantizer itself can be an HNSW index. This accelerates the cluster assignment step when there are many clusters (tens of thousands), which is important for very large datasets where you need fine-grained partitioning.

ScaNN (Scalable Nearest Neighbors)

Google's ScaNN algorithm introduces anisotropic vector quantization, which accounts for the direction of quantization error rather than minimizing error magnitude uniformly. Standard PQ minimizes the overall reconstruction error, but ScaNN specifically minimizes the error in the direction that matters for inner product computation. This produces better search quality at the same compression ratio.

📘 Binary Quantization

The simplest and most aggressive form of quantization converts each float dimension to a single bit: 1 if positive, 0 if negative. This reduces memory by 32x and enables ultra-fast Hamming distance computation. While the accuracy loss is significant when used alone, binary quantization works well as a first-pass filter followed by rescoring against full-precision vectors. Some vector databases (Qdrant, Weaviate) support this as a "binary quantization" or "oversampling + rescore" mode.

6. Index Selection Guide

Index Type Build Time Search Speed Memory Recall Best For
Flat (brute-force) None Slow High 100% < 50K vectors
HNSW Slow Very fast High 98%+ Low latency, < 10M vectors
IVF-Flat Medium Fast High 95%+ Moderate scale, updatable
IVF-PQ Medium Fast Low 90%+ Large scale, memory-constrained
HNSW + PQ Slow Fast Medium 93%+ Balanced latency and memory
IVF-HNSW-PQ Slow Very fast Low 92%+ Billion-scale datasets
⚠ Common Pitfalls

Do not over-optimize for benchmarks. ANN benchmark scores (like ann-benchmarks.com) measure performance on specific datasets with specific query patterns. Your production workload may have very different characteristics. Always benchmark with your actual data and query distribution. Additionally, remember that index parameters interact: increasing nprobe in IVF can be more effective than increasing ef_search in HNSW for some data distributions, and vice versa.

7. FAISS Index Factory

FAISS provides an index factory that lets you construct composite indexes using a string descriptor. This is the most convenient way to experiment with different configurations.

# FAISS index factory: composing index types with string descriptors
import faiss
import numpy as np

dim = 768
n_vectors = 500_000

vectors = np.random.randn(n_vectors, dim).astype(np.float32)
faiss.normalize_L2(vectors)

# Index factory string format: "preprocessing,coarse_quantizer,encoding"
index_configs = {
    "Flat":          "Flat",
    "HNSW32":        "HNSW32",
    "IVF1024,Flat":  "IVF1024,Flat",
    "IVF1024,PQ96":  "IVF1024,PQ96",
    "IVF1024,SQ8":   "IVF1024,SQ8",   # Scalar quantization (8-bit)
}

for name, factory_str in index_configs.items():
    index = faiss.index_factory(dim, factory_str, faiss.METRIC_INNER_PRODUCT)

    # Train if needed
    if not index.is_trained:
        index.train(vectors[:100_000])  # Train on subset

    index.add(vectors)

    # Estimate memory (rough approximation)
    if "PQ96" in factory_str:
        mem_mb = n_vectors * 96 / 1e6
    elif "SQ8" in factory_str:
        mem_mb = n_vectors * dim / 1e6
    else:
        mem_mb = n_vectors * dim * 4 / 1e6

    print(f"{name:>20}: {n_vectors:,} vectors, ~{mem_mb:.0f} MB")
Flat: 500,000 vectors, ~1440 MB HNSW32: 500,000 vectors, ~1440 MB IVF1024,Flat: 500,000 vectors, ~1440 MB IVF1024,PQ96: 500,000 vectors, ~48 MB IVF1024,SQ8: 500,000 vectors, ~384 MB

Section 18.2 Quiz

1. Why does brute-force nearest neighbor search become impractical at scale?

Show Answer
Brute-force search requires computing the distance between the query vector and every vector in the collection, resulting in O(N × d) operations per query. For millions of high-dimensional vectors, this translates to hundreds of milliseconds per query, which is too slow for real-time applications. ANN algorithms reduce this to sub-millisecond search by trading perfect accuracy for speed.

2. What do the parameters M and ef_construction control in HNSW?

Show Answer
M controls the maximum number of edges per node in the HNSW graph. Higher M increases connectivity, improving recall but consuming more memory. ef_construction controls how many candidates the algorithm evaluates when inserting a new node; higher values produce a higher-quality graph at the cost of slower build times. Together, they determine the graph's quality and the memory/build-time tradeoff.

3. How does Product Quantization achieve vector compression?

Show Answer
PQ divides each d-dimensional vector into m sub-vectors, then independently quantizes each sub-vector using a learned codebook with 256 entries (8 bits). Each sub-vector is replaced by its nearest codebook index (a single byte). A 768-dimensional float32 vector (3,072 bytes) split into 96 sub-vectors requires only 96 bytes: one byte per sub-vector. Distance computation uses precomputed lookup tables from the query to each codebook entry.

4. In IVF indexing, what happens when you increase the nprobe parameter?

Show Answer
Increasing nprobe causes the search to examine more cluster partitions at query time. With nprobe=1, only the cluster whose centroid is closest to the query is searched. With nprobe=32, the 32 closest clusters are searched. Higher nprobe values improve recall (finding more true nearest neighbors) but increase latency because more vectors are compared. The optimal nprobe balances your target recall with acceptable query latency.

5. When would you choose IVF-PQ over HNSW?

Show Answer
IVF-PQ is preferred over HNSW when memory is a constraint. HNSW stores full-precision vectors plus graph metadata, consuming substantial memory. IVF-PQ compresses vectors to a fraction of their original size (e.g., 32x compression). For large collections (100M+ vectors) where full-precision storage would require hundreds of gigabytes, IVF-PQ enables the entire index to fit in memory. The tradeoff is slightly lower recall compared to HNSW at the same query latency.

Key Takeaways