Section 18.1: Text Embedding Models & Training

★ Big Picture

Embeddings are the bridge between human language and machine computation. Every semantic search system, every RAG pipeline, and every vector database depends on the quality of the embeddings that encode text as dense vectors. The choice of embedding model, its training procedure, and the similarity metric used to compare vectors together determine whether a retrieval system returns relevant results or noise. This section covers the full lifecycle of text embeddings: how they are trained, how to evaluate them, and how to adapt them to specific domains.

1. From Words to Sentences: The Embedding Evolution

The journey from word-level to sentence-level embeddings represents one of the most consequential progressions in NLP. Early approaches like Word2Vec and GloVe learned to map individual words into dense vectors where geometric relationships encoded semantic relationships. The classic example of king - man + woman ≈ queen demonstrated that these vector spaces captured meaningful analogies, but word embeddings suffered from a fundamental limitation: they assigned a single vector to each word regardless of context.

Contextual embeddings from models like BERT resolved this ambiguity by producing different representations for the same word in different contexts. However, using BERT directly for sentence similarity proved problematic. Computing the similarity between two sentences required passing both through the model simultaneously (cross-encoding), making it computationally infeasible to search across millions of documents at query time.

The Bi-Encoder Architecture

The key innovation that made large-scale semantic search practical was the bi-encoder architecture, introduced by Sentence-BERT (SBERT) in 2019. Instead of feeding two sentences into one model jointly, the bi-encoder processes each sentence independently through the same transformer encoder, producing a fixed-size vector for each. These vectors can be precomputed and stored in an index, enabling similarity search with a simple dot product or cosine similarity operation at query time.

Figure 18.1: Cross-encoder vs. bi-encoder architecture. The bi-encoder enables precomputation and fast similarity search.

Pooling Strategies

A transformer encoder produces one vector per input token. To obtain a single sentence-level vector, a pooling operation aggregates these token vectors. The three common strategies are:

[CLS] token pooling: Use the output vector corresponding to the special classification token. This works well when the model has been specifically trained with a classification objective.
Mean pooling: Average all token output vectors (excluding padding). This is the default for most modern sentence transformers and tends to produce the most robust representations.
Max pooling: Take the element-wise maximum across all token vectors. This can capture salient features but is less commonly used in practice.

# Sentence embedding with mean pooling using Sentence-Transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

sentences = [
    "The cat sat on the mat.",
    "A feline rested on the rug.",
    "Stock prices rose sharply today."
]

# Encode produces normalized vectors by default
embeddings = model.encode(sentences, normalize_embeddings=True)

print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

# Compute pairwise cosine similarity
similarity_matrix = np.dot(embeddings, embeddings.T)
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(f"Similarity('{sentences[i][:30]}...', "
              f"'{sentences[j][:30]}...'): {similarity_matrix[i][j]:.4f}")

Embedding shape: (3, 384) Embedding dimension: 384 Similarity('The cat sat on the mat...', 'A feline rested on the rug...'): 0.7523 Similarity('The cat sat on the mat...', 'Stock prices rose sharply toda...'): 0.0412 Similarity('A feline rested on the rug...', 'Stock prices rose sharply toda...'): 0.0287

2. Training Embedding Models: Contrastive Learning

Modern embedding models are trained using contrastive learning, a framework where the model learns to pull similar (positive) pairs together and push dissimilar (negative) pairs apart in the embedding space. The quality of the training data, the choice of loss function, and the strategy for selecting hard negatives together determine the final embedding quality.

Loss Functions

Multiple Negatives Ranking Loss (MNRL)

The most widely used loss function for embedding training is Multiple Negatives Ranking Loss (also called InfoNCE). Given a batch of N positive pairs (query, positive_passage), the loss treats the other N-1 passages in the batch as negatives for each query. This "in-batch negatives" approach is highly efficient because it provides N-1 negatives for free, without requiring explicit negative sampling.

# Simplified InfoNCE / Multiple Negatives Ranking Loss
import torch
import torch.nn.functional as F

def multiple_negatives_ranking_loss(query_emb, passage_emb, temperature=0.05):
    """
    query_emb: (batch_size, embed_dim) - query embeddings
    passage_emb: (batch_size, embed_dim) - positive passage embeddings
    Each query_emb[i] pairs with passage_emb[i] (positive).
    All other passages in the batch serve as negatives.
    """
    # Compute similarity matrix: (batch_size, batch_size)
    similarity = torch.matmul(query_emb, passage_emb.T) / temperature

    # Labels: diagonal entries are the positives
    labels = torch.arange(similarity.size(0), device=similarity.device)

    # Cross-entropy loss treats this as N-way classification
    loss = F.cross_entropy(similarity, labels)
    return loss

# Example: batch of 4 query-passage pairs
batch_size, dim = 4, 384
queries = F.normalize(torch.randn(batch_size, dim), dim=-1)
passages = F.normalize(torch.randn(batch_size, dim), dim=-1)

loss = multiple_negatives_ranking_loss(queries, passages)
print(f"MNRL Loss: {loss.item():.4f}")

MNRL Loss: 1.4217

Triplet Loss and Other Objectives

Earlier approaches used triplet loss, which operates on (anchor, positive, negative) triples and enforces a margin between the positive and negative distances. While simpler conceptually, triplet loss is less sample-efficient than MNRL because it uses only one negative per anchor. Other loss variants include cosine similarity loss for regression-style training on continuous similarity scores, and distillation losses that transfer knowledge from a cross-encoder teacher to a bi-encoder student.

Hard Negative Mining

The choice of negative examples profoundly impacts embedding quality. Random negatives are typically too easy for the model to distinguish, providing little learning signal. Hard negatives are passages that are superficially similar to the query but are not actually relevant. They force the model to learn fine-grained distinctions.

📘 Hard Negative Strategies

Common approaches for mining hard negatives include: (1) BM25 negatives, where you retrieve the top BM25 results that are not labeled as positive; (2) model-mined negatives, where you use a previous version of the embedding model to find near-miss passages; and (3) cross-encoder reranking, where a cross-encoder scores candidates and borderline cases become hard negatives. The most effective pipelines combine multiple strategies, starting with BM25 negatives for initial training and then mining harder negatives with the trained model for further fine-tuning rounds.

# Hard negative mining with a trained embedding model
from sentence_transformers import SentenceTransformer
import numpy as np

def mine_hard_negatives(model, queries, corpus, positives_map,
                        top_k=30, num_negatives=5):
    """
    Mine hard negatives by finding similar but non-relevant passages.

    Args:
        queries: list of query strings
        corpus: list of all passage strings
        positives_map: dict mapping query_idx -> set of positive passage indices
        top_k: number of candidates to retrieve
        num_negatives: number of hard negatives per query
    """
    # Encode everything
    query_embs = model.encode(queries, normalize_embeddings=True)
    corpus_embs = model.encode(corpus, normalize_embeddings=True)

    hard_negatives = {}

    for q_idx, q_emb in enumerate(query_embs):
        # Compute similarities to all passages
        sims = np.dot(corpus_embs, q_emb)

        # Get top-k most similar passages
        top_indices = np.argsort(sims)[::-1][:top_k]

        # Filter out actual positives to keep only hard negatives
        positive_set = positives_map.get(q_idx, set())
        neg_indices = [idx for idx in top_indices if idx not in positive_set]

        hard_negatives[q_idx] = neg_indices[:num_negatives]

    return hard_negatives

# Example usage
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
queries = ["What causes diabetes?", "How does photosynthesis work?"]
corpus = [
    "Diabetes is caused by insulin resistance or insufficient insulin production.",
    "Type 2 diabetes risk factors include obesity and sedentary lifestyle.",
    "Photosynthesis converts sunlight into chemical energy in plants.",
    "The Calvin cycle fixes carbon dioxide into glucose molecules.",
    "Machine learning models require large datasets for training.",
]
positives_map = {0: {0, 1}, 1: {2, 3}}

negatives = mine_hard_negatives(model, queries, corpus, positives_map)
print("Hard negatives for each query:")
for q_idx, neg_ids in negatives.items():
    print(f"  Query: '{queries[q_idx]}'")
    for nid in neg_ids:
        print(f"    Negative: '{corpus[nid][:60]}...'")

3. Modern Embedding Architectures

Matryoshka Representation Learning (MRL)

Traditional embedding models produce fixed-dimension vectors (e.g., 768 or 1536 dimensions). If you need smaller vectors for storage efficiency, you must train a separate model. Matryoshka Representation Learning (Kusupati et al., 2022) solves this by training a single model whose embeddings are useful at multiple dimensionalities. The first d dimensions of a 768-dimensional embedding form a valid d-dimensional embedding, much like nested Russian dolls.

During training, the loss function is computed at multiple truncation points (e.g., dimensions 32, 64, 128, 256, 512, 768), and the gradients from all truncation levels are summed. This forces the model to pack the most important information into the leading dimensions.

Figure 18.2: Matryoshka embeddings pack information hierarchically, enabling flexible dimensionality reduction by truncation.

ColBERT: Late Interaction

ColBERT introduces a late interaction paradigm that sits between bi-encoders and cross-encoders. Instead of compressing each sentence into a single vector, ColBERT retains per-token embeddings for both the query and the document. At search time, it computes the maximum similarity between each query token and all document tokens (MaxSim), then sums these scores across query tokens.

This approach preserves much of the expressiveness of cross-encoders while still allowing document token embeddings to be precomputed. The tradeoff is storage: a document with 200 tokens requires storing 200 vectors instead of one. ColBERT v2 addresses this through residual compression, reducing storage by 6 to 10 times while preserving retrieval quality.

# ColBERT-style MaxSim scoring (simplified)
import torch
import torch.nn.functional as F

def colbert_score(query_tokens, doc_tokens):
    """
    Compute ColBERT late-interaction score.

    query_tokens: (num_query_tokens, dim) - per-token query embeddings
    doc_tokens: (num_doc_tokens, dim) - per-token document embeddings
    """
    # Normalize token embeddings
    query_tokens = F.normalize(query_tokens, dim=-1)
    doc_tokens = F.normalize(doc_tokens, dim=-1)

    # Similarity matrix: (num_query_tokens, num_doc_tokens)
    sim_matrix = torch.matmul(query_tokens, doc_tokens.T)

    # MaxSim: for each query token, find max similarity to any doc token
    max_sim_per_query_token = sim_matrix.max(dim=-1).values

    # Sum across query tokens
    score = max_sim_per_query_token.sum()
    return score

# Example
num_q_tokens, num_d_tokens, dim = 8, 50, 128
q_embs = torch.randn(num_q_tokens, dim)
d_embs = torch.randn(num_d_tokens, dim)

score = colbert_score(q_embs, d_embs)
print(f"ColBERT score: {score.item():.4f}")

ColBERT score: 12.3847

4. Embedding Model Ecosystem and Selection

API Embedding Services

For teams that prefer managed solutions, several providers offer embedding APIs. OpenAI's text-embedding-3-small and text-embedding-3-large models support Matryoshka-style dimension reduction through a dimensions parameter. Cohere's embed-v3 models support separate input types for queries and documents, which can improve retrieval quality. Google's Vertex AI provides the Gecko embedding model with built-in task-type specification.

# Using OpenAI embeddings with dimension control
from openai import OpenAI
import numpy as np

client = OpenAI()

texts = [
    "Retrieval augmented generation combines search with LLMs.",
    "RAG systems ground language model outputs in retrieved documents.",
    "The weather forecast predicts rain tomorrow."
]

# Generate embeddings at reduced dimensionality (Matryoshka-style)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=texts,
    dimensions=256  # Reduce from default 1536 to 256
)

embeddings = np.array([item.embedding for item in response.data])
print(f"Shape: {embeddings.shape}")

# Compute cosine similarities
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / norms
similarity = np.dot(normalized, normalized.T)

print(f"RAG-related similarity: {similarity[0][1]:.4f}")
print(f"Unrelated similarity:   {similarity[0][2]:.4f}")

Shape: (3, 256) RAG-related similarity: 0.8341 Unrelated similarity: 0.1205

The MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) evaluates embedding models across multiple tasks: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. MTEB covers dozens of datasets across multiple languages, making it the standard reference for comparing embedding models.

⚠ Benchmark Limitations

MTEB scores are useful for narrowing your choices, but they do not guarantee performance on your specific data. Models that score highest on MTEB may underperform on domain-specific tasks (legal, medical, scientific) compared to models fine-tuned on in-domain data. Always evaluate candidate models on a representative sample of your actual queries and documents before committing to a production deployment.

Model	Dimensions	Max Tokens	MTEB Avg	Type
text-embedding-3-large	3072 (adjustable)	8191	64.6	API
text-embedding-3-small	1536 (adjustable)	8191	62.3	API
voyage-3	1024	32000	67.5	API
GTE-Qwen2-7B	3584	32768	70.2	Open-source
E5-Mistral-7B	4096	32768	66.6	Open-source
all-MiniLM-L6-v2	384	256	56.3	Open-source
bge-large-en-v1.5	1024	512	64.2	Open-source
nomic-embed-text-v1.5	768 (Matryoshka)	8192	62.3	Open-source

5. Embedding Space Geometry

Similarity Metrics

The choice of similarity metric affects both retrieval quality and computational performance. The three standard metrics are:

Cosine similarity: Measures the angle between vectors, ignoring magnitude. Ranges from -1 to 1. Most commonly used for text embeddings because it is invariant to vector length. Equivalent to dot product on L2-normalized vectors.
Dot product (inner product): Measures both directional similarity and magnitude. Useful when vector length carries meaningful information (e.g., passage importance). Faster than cosine similarity when vectors are not pre-normalized.
Euclidean distance (L2): Measures straight-line distance in the embedding space. Lower values indicate greater similarity. Equivalent to cosine distance for normalized vectors, but can behave differently for unnormalized vectors.

🔑 Key Insight: Normalize First

For most text embedding models, L2-normalizing your vectors before indexing simplifies everything. When vectors are unit-length, cosine similarity equals dot product, and Euclidean distance is a monotonic transformation of cosine distance. This means you can use the fastest available operation (dot product) and get equivalent results to cosine similarity. Most modern embedding models either normalize output by default or provide a parameter to enable it.

Dimensionality and the Curse of Dimensionality

In high-dimensional spaces, distances between points tend to concentrate: the difference between the nearest and farthest neighbor shrinks relative to the overall distance. This phenomenon, known as the curse of dimensionality, means that exact nearest neighbor search becomes less meaningful as dimensionality grows. In practice, this is why approximate nearest neighbor (ANN) algorithms work so well for embeddings: the approximation error introduced by ANN is often smaller than the noise inherent in high-dimensional distance computations.

Figure 18.3: Embedding spaces organize semantically related concepts into clusters, enabling nearest-neighbor retrieval.

6. Fine-Tuning Embeddings for Domain Specificity

General-purpose embedding models may not capture domain-specific terminology or relationships well. For example, in a legal domain, "consideration" means something entirely different from its everyday usage. Fine-tuning an embedding model on domain-specific data can substantially improve retrieval quality.

# Fine-tuning a sentence transformer on domain-specific data
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from datasets import Dataset

# Prepare training data: (anchor, positive) pairs
# Hard negatives are automatically mined from in-batch examples
train_data = Dataset.from_dict({
    "anchor": [
        "What is consideration in contract law?",
        "Define the doctrine of estoppel",
        "What constitutes a breach of fiduciary duty?",
    ],
    "positive": [
        "Consideration is something of value exchanged between parties to a contract.",
        "Estoppel prevents a party from asserting a claim inconsistent with prior conduct.",
        "A fiduciary breach occurs when a fiduciary acts against the beneficiary interest.",
    ],
})

# Load base model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Configure training
loss = losses.MultipleNegativesRankingLoss(model)

args = SentenceTransformerTrainingArguments(
    output_dir="./legal-embedding-model",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_data,
    loss=loss,
)

trainer.train()
model.save_pretrained("./legal-embedding-model")

📘 Practical Fine-Tuning Tips

Effective embedding fine-tuning typically requires at least 1,000 to 10,000 query-passage pairs from your domain. Start with a strong general-purpose base model (such as bge-large-en-v1.5 or GTE-large) rather than training from scratch. Use a small learning rate (1e-5 to 3e-5) and monitor validation retrieval metrics (recall@10, MRR@10) to detect overfitting early. If you have limited labeled data, consider using LLM-generated synthetic queries to augment your training set.

7. Practical Considerations

Query and Document Prefixes

Many modern embedding models (E5, BGE, GTE) use instruction prefixes to distinguish between different input types. For example, the E5 model family expects queries to be prefixed with "query: " and documents with "passage: ". Forgetting these prefixes can significantly degrade retrieval performance. Always check the model documentation for required formatting conventions.

Sequence Length and Truncation

Every embedding model has a maximum sequence length, beyond which input is truncated. Older models like all-MiniLM-L6-v2 support only 256 tokens, while newer models such as nomic-embed-text-v1.5 and GTE-Qwen2 handle 8,192 or even 32,768 tokens. When your documents exceed the model's limit, you must chunk them before embedding, which is covered in detail in Section 18.4.

Batch Size and Throughput

Embedding throughput scales linearly with batch size up to the GPU memory limit. For production workloads, encode documents in large batches (256 to 1024) to maximize GPU utilization. For real-time query encoding, latency is more important than throughput, so smaller batch sizes (1 to 16) are typical. Consider using ONNX Runtime or TensorRT for optimized inference if you are serving embeddings at scale.

Section 18.1 Quiz

1. What is the key architectural difference between a cross-encoder and a bi-encoder?

Show Answer

A cross-encoder processes both sentences jointly through a single transformer pass, producing a similarity score directly. A bi-encoder processes each sentence independently through separate (shared-weight) encoder passes, producing a vector for each sentence. The bi-encoder enables precomputation of document vectors, while the cross-encoder must recompute for every query-document pair.

2. Why is mean pooling generally preferred over [CLS] token pooling for sentence embeddings?

Show Answer

Mean pooling averages information from all token positions, producing a more comprehensive sentence representation that accounts for every word. The [CLS] token was designed for classification tasks and may not aggregate sentence-level semantics as effectively unless the model was specifically trained with a [CLS]-based objective. Empirically, mean pooling tends to produce more robust embeddings across diverse tasks.

3. What problem do Matryoshka embeddings solve, and how?

Show Answer

Matryoshka embeddings solve the problem of needing different embedding dimensionalities for different use cases (e.g., low-latency search vs. high-accuracy search). They achieve this by training a single model whose embeddings are valid at multiple truncation points. During training, the loss is computed at several dimension prefixes simultaneously, forcing the model to concentrate the most important information in the leading dimensions. You can then truncate the vector to any prefix length and still get useful embeddings.

4. Why are hard negatives important for training embedding models?

Show Answer

Hard negatives are passages that are superficially similar to the query (e.g., they share keywords) but are not actually relevant. They force the model to learn fine-grained semantic distinctions rather than relying on simple lexical overlap. Training with only random negatives results in a model that can distinguish obviously different topics but fails to differentiate between closely related passages. Hard negatives push the model to develop deeper understanding of relevance.

5. When vectors are L2-normalized, what is the relationship between cosine similarity and dot product?

Show Answer

When vectors are L2-normalized (unit length), cosine similarity and dot product are mathematically identical. Cosine similarity is defined as dot(a, b) / (||a|| * ||b||). When ||a|| = ||b|| = 1, this simplifies to dot(a, b). This is why normalizing embeddings is a common best practice: it allows you to use the faster dot product operation while getting the same ranking as cosine similarity.

Key Takeaways

Bi-encoders enable scalable search by precomputing document embeddings, reducing query-time computation to a single dot product per candidate.
Contrastive learning with in-batch negatives (MNRL/InfoNCE) is the standard training approach, and hard negative mining is critical for pushing embedding quality beyond baseline levels.
Matryoshka embeddings provide flexible dimensionality from a single model, enabling storage/accuracy tradeoffs without retraining.
ColBERT's late interaction preserves per-token information for better accuracy at the cost of higher storage requirements.
Always evaluate on your data. MTEB scores provide a starting point, but domain-specific evaluation is essential for production model selection.
L2-normalize your embeddings to simplify metric choices and enable the fastest similarity computation (dot product).
Fine-tuning on 1K to 10K domain pairs can dramatically improve retrieval quality for specialized applications.