Section 13.5: Fine-Tuning for Representation Learning

★ Big Picture

Embeddings are the backbone of modern search, retrieval, and recommendation systems. While off-the-shelf embedding models work well for general text, domain-specific applications often benefit enormously from fine-tuned embeddings that understand the nuances of your particular domain. A legal search engine needs embeddings that distinguish between subtly different contract clauses; a medical retrieval system needs embeddings that capture clinical relationships. This section covers why and how to fine-tune models for better representations.

1. Why Fine-Tune for Representations?

Off-the-shelf embedding models like OpenAI's text-embedding-3 or open-source models like BGE and E5 are trained on broad web data. They produce good general-purpose embeddings, but they may not capture the semantic distinctions that matter in your specific domain. Fine-tuning teaches the model which texts should be similar and which should be different according to your application's needs.

1.1 When Off-the-Shelf Falls Short

General embedding models struggle in several common scenarios. Domain-specific vocabulary (medical terms, legal jargon, internal company terminology) may be poorly represented. The notion of "similarity" may differ from the general case: in a customer support system, two tickets describing the same bug should be similar even if they use completely different language. In a patent search system, documents covering the same invention should cluster together despite varying levels of technical detail.

Scenario	Off-the-Shelf Performance	After Fine-Tuning	Improvement
General web search	NDCG@10: 0.52	Not needed	N/A
Medical literature retrieval	NDCG@10: 0.38	NDCG@10: 0.56	+47%
Legal clause matching	NDCG@10: 0.31	NDCG@10: 0.54	+74%
Internal docs search	NDCG@10: 0.42	NDCG@10: 0.61	+45%
Customer support dedup	F1: 0.65	F1: 0.84	+29%

🔑 Key Insight

The more specialized your domain, the more you benefit from fine-tuning. If your domain vocabulary overlaps heavily with general web text (e.g., product reviews, news articles), off-the-shelf embeddings will work reasonably well. But if your domain has specialized terminology, unusual notions of similarity, or if retrieval precision is critical to your application, fine-tuning can yield 30% to 70% improvements.

2. Encoder-Only vs. Decoder-Only for Embeddings

Historically, encoder-only models (BERT, RoBERTa) dominated the embedding space because their bidirectional architecture naturally produces rich token representations. Decoder-only models (GPT, Llama) are autoregressive and were not originally designed for embeddings. However, recent work has shown that decoder-only models can produce competitive embeddings with the right training approach.

Figure 13.11: Encoder models use [CLS] or mean pooling; decoder models typically use the last token as the sentence embedding

Aspect	Encoder-Only	Decoder-Only
Architecture	Bidirectional (BERT, RoBERTa)	Causal/autoregressive (Llama, Mistral)
Pooling strategy	[CLS] token or mean pooling	Last token or mean pooling
Typical model size	100M to 400M parameters	1B to 70B parameters
Embedding dimension	768 to 1024	2048 to 8192
Max sequence length	512 tokens (typical)	4K to 128K tokens
Inference speed	Fast (small model)	Slower (large model)
Quality for retrieval	Excellent with fine-tuning	Competitive with fine-tuning
Best for	High-throughput retrieval	When you already have the model deployed

3. Contrastive Learning for Embeddings

The standard approach for fine-tuning embedding models is contrastive learning. The core idea is simple: train the model so that embeddings of semantically similar texts are close together, while embeddings of dissimilar texts are far apart. This is achieved through carefully constructed training pairs and a contrastive loss function.

3.1 Training Data: Pairs and Triplets

# Preparing contrastive training data
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ContrastivePair:
    """A pair of texts with a similarity label."""
    anchor: str          # The query or reference text
    positive: str        # Text that should be similar to anchor
    negative: str = None # Text that should be different (for triplet loss)
    score: float = 1.0   # Similarity score (0 to 1) for soft labels

# Example: medical retrieval training data
medical_pairs = [
    ContrastivePair(
        anchor="What are the symptoms of type 2 diabetes?",
        positive="Type 2 diabetes symptoms include increased thirst, frequent "
                 "urination, blurred vision, fatigue, and slow wound healing.",
        negative="Type 1 diabetes is an autoimmune condition where the immune "
                 "system attacks insulin-producing beta cells in the pancreas."
    ),
    ContrastivePair(
        anchor="Treatment options for hypertension",
        positive="First-line treatments for high blood pressure include ACE "
                 "inhibitors, ARBs, calcium channel blockers, and thiazide "
                 "diuretics, often combined with lifestyle modifications.",
        negative="Hypotension, or low blood pressure, is typically treated by "
                 "increasing fluid intake and wearing compression stockings."
    ),
]

# Convert to the format expected by Sentence Transformers
def pairs_to_dataset(pairs: List[ContrastivePair]) -> dict:
    """Convert contrastive pairs to training format."""
    anchors = [p.anchor for p in pairs]
    positives = [p.positive for p in pairs]
    negatives = [p.negative for p in pairs if p.negative]

    if negatives:
        return {
            "anchor": anchors,
            "positive": positives,
            "negative": negatives,
        }
    return {
        "anchor": anchors,
        "positive": positives,
    }

3.2 Fine-Tuning with Sentence Transformers

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from datasets import Dataset

# Load a pre-trained embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Prepare training dataset
train_data = Dataset.from_dict({
    "anchor": [
        "What causes migraines?",
        "Side effects of metformin",
        "How to diagnose celiac disease",
    ],
    "positive": [
        "Migraines are caused by abnormal brain activity affecting nerve "
        "signals, chemicals, and blood vessels in the brain.",
        "Common side effects of metformin include nausea, diarrhea, stomach "
        "pain, and a metallic taste in the mouth.",
        "Celiac disease is diagnosed through blood tests for specific "
        "antibodies (tTG-IgA) followed by an intestinal biopsy.",
    ],
    "negative": [
        "Tension headaches are the most common type of headache and are "
        "usually caused by muscle tension in the head and neck.",
        "Metformin is a first-line medication for type 2 diabetes that "
        "works by reducing glucose production in the liver.",
        "Irritable bowel syndrome is a functional disorder affecting the "
        "large intestine with symptoms of cramping and bloating.",
    ],
})

# Choose a loss function
# MultipleNegativesRankingLoss: best for pairs (anchor, positive)
# TripletLoss: for (anchor, positive, negative) triplets
loss = losses.MultipleNegativesRankingLoss(model)

# Configure training
training_args = SentenceTransformerTrainingArguments(
    output_dir="./models/medical-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    logging_steps=10,
)

# Train
trainer = SentenceTransformerTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    loss=loss,
)

trainer.train()
model.save_pretrained("./models/medical-embeddings-final")

📝 Note

MultipleNegativesRankingLoss is the workhorse. This loss function treats all other examples in the batch as negatives, meaning you get many negative pairs "for free" from each batch. With a batch size of 32, each anchor has 1 positive and 31 in-batch negatives. This is why larger batch sizes generally produce better embedding models: more negatives lead to a harder and more informative contrastive signal.

4. Evaluating Embedding Quality

Evaluating fine-tuned embeddings requires task-specific metrics. For retrieval, the standard metrics are NDCG@k, Recall@k, and MRR. For clustering and classification, you can measure cluster purity or downstream classifier accuracy. Always evaluate on a held-out test set that was not seen during training.

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
)
import numpy as np

def evaluate_retrieval_quality(
    model: SentenceTransformer,
    queries: dict,       # {qid: query_text}
    corpus: dict,        # {cid: corpus_text}
    relevant: dict,      # {qid: set(cid1, cid2, ...)}
) -> dict:
    """Evaluate embedding model on retrieval task."""
    evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant,
        name="domain-retrieval",
        ndcg_at_k=[1, 5, 10],
        recall_at_k=[1, 5, 10, 50],
        mrr_at_k=[10],
        show_progress_bar=True,
    )

    results = evaluator(model)
    return results

def compare_models(
    model_names: list,
    queries: dict,
    corpus: dict,
    relevant: dict,
):
    """Compare multiple embedding models on the same evaluation set."""
    results = {}
    for name in model_names:
        model = SentenceTransformer(name)
        metrics = evaluate_retrieval_quality(model, queries, corpus, relevant)
        results[name] = {
            "NDCG@10": metrics.get("domain-retrieval_ndcg@10", 0),
            "Recall@10": metrics.get("domain-retrieval_recall@10", 0),
            "MRR@10": metrics.get("domain-retrieval_mrr@10", 0),
        }
        print(f"\n{name}:")
        for metric, value in results[name].items():
            print(f"  {metric}: {value:.4f}")

    return results

# Example comparison
# compare_models(
#     ["BAAI/bge-base-en-v1.5", "./models/medical-embeddings-final"],
#     queries, corpus, relevant_docs
# )

5. When to Fine-Tune vs. Use Off-the-Shelf

Figure 13.12: Decision framework for choosing between off-the-shelf and fine-tuned embeddings

⚠ Warning

Fine-tuned embeddings need reindexing. If you fine-tune your embedding model, all previously computed embeddings in your vector database become stale. You must recompute embeddings for your entire corpus using the new model and reindex them. For large corpora (millions of documents), this can take hours and significant compute. Plan for this cost before committing to embedding fine-tuning, and establish a reindexing pipeline that can run incrementally.

# Practical: deciding whether to fine-tune embeddings
def should_finetune_embeddings(
    baseline_ndcg: float,
    target_ndcg: float,
    corpus_size: int,
    num_training_pairs: int,
    reindex_cost_hours: float,
) -> dict:
    """Decision helper for embedding fine-tuning."""
    gap = target_ndcg - baseline_ndcg
    has_enough_data = num_training_pairs >= 1000
    gap_is_significant = gap > 0.05

    recommendation = "off-the-shelf"
    reasons = []

    if not gap_is_significant:
        reasons.append("Gap to target is small (<5%); fine-tuning unlikely to help")
    elif not has_enough_data:
        reasons.append("Need at least 1,000 training pairs; consider generating "
                       "synthetic pairs with an LLM")
        recommendation = "generate_data_first"
    else:
        expected_improvement = min(gap * 1.5, 0.25)  # Conservative estimate
        expected_ndcg = baseline_ndcg + expected_improvement
        if expected_ndcg >= target_ndcg:
            recommendation = "fine-tune"
            reasons.append(f"Expected NDCG after fine-tuning: ~{expected_ndcg:.2f}")
        else:
            recommendation = "fine-tune + improve retrieval pipeline"
            reasons.append("Fine-tuning alone may not close the gap; "
                          "consider hybrid retrieval (BM25 + dense)")

    reasons.append(f"Reindexing will take ~{reindex_cost_hours:.1f} hours "
                   f"for {corpus_size:,} documents")

    return {
        "recommendation": recommendation,
        "baseline": baseline_ndcg,
        "target": target_ndcg,
        "gap": gap,
        "reasons": reasons,
    }

result = should_finetune_embeddings(
    baseline_ndcg=0.38,
    target_ndcg=0.55,
    corpus_size=500_000,
    num_training_pairs=5_000,
    reindex_cost_hours=3.5,
)
for k, v in result.items():
    print(f"  {k}: {v}")

recommendation: fine-tune baseline: 0.38 target: 0.55 gap: 0.17000000000000004 reasons: ['Expected NDCG after fine-tuning: ~0.63', 'Reindexing will take ~3.5 hours for 500,000 documents']

Section 13.5 Quiz

1. Why do specialized domains benefit more from fine-tuned embeddings than general-purpose domains?

Show Answer

Off-the-shelf embedding models are trained on broad web data, which means they understand general-purpose notions of text similarity well. Specialized domains, however, have unique vocabulary, abbreviations, and semantic relationships that are underrepresented in the training data. For example, in medical text, "MI" means myocardial infarction, not a state abbreviation. Fine-tuning teaches the model these domain-specific semantic distinctions, leading to much larger improvements in specialized domains than in general ones.

2. What is the advantage of using MultipleNegativesRankingLoss over basic triplet loss?

Show Answer

MultipleNegativesRankingLoss treats all other examples in the batch as negatives, providing many more negative examples per training step. With a batch size of 32, each anchor gets 31 negatives instead of just 1. This creates a harder and more informative training signal, leading to better embeddings with fewer training examples. It also eliminates the need to explicitly mine hard negatives in your training data.

3. What pooling strategy is typically used for decoder-only models when computing sentence embeddings?

Show Answer

Decoder-only models typically use last-token pooling, where the hidden state of the final token (usually the EOS token) serves as the sentence embedding. Because of the causal attention mask, only the last token has "seen" all previous tokens in the sequence, making it the most informationally rich representation of the entire input. Some approaches also use mean pooling over all token representations, which can work but requires careful handling of the attention mask.

4. What practical cost must you account for when switching to fine-tuned embeddings in a production system?

Show Answer

You must recompute and reindex all embeddings in your vector database. Since the fine-tuned model produces different embeddings than the original model, all previously stored vectors become stale and incompatible with queries encoded by the new model. For large corpora (millions of documents), this can require hours of compute time and careful orchestration to avoid downtime. You need a reindexing pipeline that can handle this process, ideally with the ability to run incrementally or with blue-green deployment.

5. A team has 200 query-document pairs for a legal search system. Is this enough to fine-tune an embedding model?

Show Answer

200 pairs is generally not enough for effective embedding fine-tuning. The minimum recommended is approximately 1,000 pairs, with 5,000 to 10,000 being ideal. With only 200 pairs, the team should first generate synthetic training pairs using an LLM (e.g., prompting GPT-4 to create query-passage pairs from their legal corpus), then combine those with the 200 real pairs for fine-tuning. They should also benchmark off-the-shelf models first to confirm that fine-tuning is actually needed for their use case.

Key Takeaways

Fine-tuned embeddings provide 30% to 70% improvement over off-the-shelf models in specialized domains where vocabulary and similarity notions differ from general text.
Encoder-only models (BERT, BGE) remain the practical choice for high-throughput embedding tasks due to their small size and fast inference; decoder-only models are competitive but slower.
Contrastive learning with MultipleNegativesRankingLoss is the standard approach: it uses in-batch negatives to create a strong training signal without explicit hard negative mining.
You need at least 1,000 training pairs for effective fine-tuning; if you have fewer, generate synthetic pairs using an LLM before fine-tuning.
Always benchmark off-the-shelf first to measure the actual performance gap before investing in fine-tuning and the associated reindexing costs.
Reindexing is a hidden cost: switching to a fine-tuned embedding model requires recomputing embeddings for your entire corpus, which must be planned into the deployment timeline.