Module 13 · Section 13.5

Fine-Tuning for Representation Learning

Training models to produce high-quality embeddings for retrieval, similarity, and clustering through contrastive learning and sentence transformers
★ Big Picture

Embeddings are the backbone of modern search, retrieval, and recommendation systems. While off-the-shelf embedding models work well for general text, domain-specific applications often benefit enormously from fine-tuned embeddings that understand the nuances of your particular domain. A legal search engine needs embeddings that distinguish between subtly different contract clauses; a medical retrieval system needs embeddings that capture clinical relationships. This section covers why and how to fine-tune models for better representations.

1. Why Fine-Tune for Representations?

Off-the-shelf embedding models like OpenAI's text-embedding-3 or open-source models like BGE and E5 are trained on broad web data. They produce good general-purpose embeddings, but they may not capture the semantic distinctions that matter in your specific domain. Fine-tuning teaches the model which texts should be similar and which should be different according to your application's needs.

1.1 When Off-the-Shelf Falls Short

General embedding models struggle in several common scenarios. Domain-specific vocabulary (medical terms, legal jargon, internal company terminology) may be poorly represented. The notion of "similarity" may differ from the general case: in a customer support system, two tickets describing the same bug should be similar even if they use completely different language. In a patent search system, documents covering the same invention should cluster together despite varying levels of technical detail.

ScenarioOff-the-Shelf PerformanceAfter Fine-TuningImprovement
General web searchNDCG@10: 0.52Not neededN/A
Medical literature retrievalNDCG@10: 0.38NDCG@10: 0.56+47%
Legal clause matchingNDCG@10: 0.31NDCG@10: 0.54+74%
Internal docs searchNDCG@10: 0.42NDCG@10: 0.61+45%
Customer support dedupF1: 0.65F1: 0.84+29%
🔑 Key Insight

The more specialized your domain, the more you benefit from fine-tuning. If your domain vocabulary overlaps heavily with general web text (e.g., product reviews, news articles), off-the-shelf embeddings will work reasonably well. But if your domain has specialized terminology, unusual notions of similarity, or if retrieval precision is critical to your application, fine-tuning can yield 30% to 70% improvements.

2. Encoder-Only vs. Decoder-Only for Embeddings

Historically, encoder-only models (BERT, RoBERTa) dominated the embedding space because their bidirectional architecture naturally produces rich token representations. Decoder-only models (GPT, Llama) are autoregressive and were not originally designed for embeddings. However, recent work has shown that decoder-only models can produce competitive embeddings with the right training approach.

Embedding Extraction: Encoder vs. Decoder Models Encoder-Only (BERT, BGE) [CLS] The patient has fever [SEP] Bidirectional attention (all tokens attend to all other tokens) [CLS] token = embedding Native fit for embeddings Models: 100M to 400M params Decoder-Only (Llama, Mistral) The patient has fever <eos> Causal attention (each token attends only to previous tokens) Last token = embedding Requires adaptation for embeddings Models: 1B to 70B params
Figure 13.11: Encoder models use [CLS] or mean pooling; decoder models typically use the last token as the sentence embedding
AspectEncoder-OnlyDecoder-Only
ArchitectureBidirectional (BERT, RoBERTa)Causal/autoregressive (Llama, Mistral)
Pooling strategy[CLS] token or mean poolingLast token or mean pooling
Typical model size100M to 400M parameters1B to 70B parameters
Embedding dimension768 to 10242048 to 8192
Max sequence length512 tokens (typical)4K to 128K tokens
Inference speedFast (small model)Slower (large model)
Quality for retrievalExcellent with fine-tuningCompetitive with fine-tuning
Best forHigh-throughput retrievalWhen you already have the model deployed

3. Contrastive Learning for Embeddings

The standard approach for fine-tuning embedding models is contrastive learning. The core idea is simple: train the model so that embeddings of semantically similar texts are close together, while embeddings of dissimilar texts are far apart. This is achieved through carefully constructed training pairs and a contrastive loss function.

3.1 Training Data: Pairs and Triplets

# Preparing contrastive training data
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class ContrastivePair:
    """A pair of texts with a similarity label."""
    anchor: str          # The query or reference text
    positive: str        # Text that should be similar to anchor
    negative: str = None # Text that should be different (for triplet loss)
    score: float = 1.0   # Similarity score (0 to 1) for soft labels

# Example: medical retrieval training data
medical_pairs = [
    ContrastivePair(
        anchor="What are the symptoms of type 2 diabetes?",
        positive="Type 2 diabetes symptoms include increased thirst, frequent "
                 "urination, blurred vision, fatigue, and slow wound healing.",
        negative="Type 1 diabetes is an autoimmune condition where the immune "
                 "system attacks insulin-producing beta cells in the pancreas."
    ),
    ContrastivePair(
        anchor="Treatment options for hypertension",
        positive="First-line treatments for high blood pressure include ACE "
                 "inhibitors, ARBs, calcium channel blockers, and thiazide "
                 "diuretics, often combined with lifestyle modifications.",
        negative="Hypotension, or low blood pressure, is typically treated by "
                 "increasing fluid intake and wearing compression stockings."
    ),
]

# Convert to the format expected by Sentence Transformers
def pairs_to_dataset(pairs: List[ContrastivePair]) -> dict:
    """Convert contrastive pairs to training format."""
    anchors = [p.anchor for p in pairs]
    positives = [p.positive for p in pairs]
    negatives = [p.negative for p in pairs if p.negative]

    if negatives:
        return {
            "anchor": anchors,
            "positive": positives,
            "negative": negatives,
        }
    return {
        "anchor": anchors,
        "positive": positives,
    }

3.2 Fine-Tuning with Sentence Transformers

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    losses,
)
from datasets import Dataset

# Load a pre-trained embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Prepare training dataset
train_data = Dataset.from_dict({
    "anchor": [
        "What causes migraines?",
        "Side effects of metformin",
        "How to diagnose celiac disease",
    ],
    "positive": [
        "Migraines are caused by abnormal brain activity affecting nerve "
        "signals, chemicals, and blood vessels in the brain.",
        "Common side effects of metformin include nausea, diarrhea, stomach "
        "pain, and a metallic taste in the mouth.",
        "Celiac disease is diagnosed through blood tests for specific "
        "antibodies (tTG-IgA) followed by an intestinal biopsy.",
    ],
    "negative": [
        "Tension headaches are the most common type of headache and are "
        "usually caused by muscle tension in the head and neck.",
        "Metformin is a first-line medication for type 2 diabetes that "
        "works by reducing glucose production in the liver.",
        "Irritable bowel syndrome is a functional disorder affecting the "
        "large intestine with symptoms of cramping and bloating.",
    ],
})

# Choose a loss function
# MultipleNegativesRankingLoss: best for pairs (anchor, positive)
# TripletLoss: for (anchor, positive, negative) triplets
loss = losses.MultipleNegativesRankingLoss(model)

# Configure training
training_args = SentenceTransformerTrainingArguments(
    output_dir="./models/medical-embeddings",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    logging_steps=10,
)

# Train
trainer = SentenceTransformerTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    loss=loss,
)

trainer.train()
model.save_pretrained("./models/medical-embeddings-final")
📝 Note

MultipleNegativesRankingLoss is the workhorse. This loss function treats all other examples in the batch as negatives, meaning you get many negative pairs "for free" from each batch. With a batch size of 32, each anchor has 1 positive and 31 in-batch negatives. This is why larger batch sizes generally produce better embedding models: more negatives lead to a harder and more informative contrastive signal.

4. Evaluating Embedding Quality

Evaluating fine-tuned embeddings requires task-specific metrics. For retrieval, the standard metrics are NDCG@k, Recall@k, and MRR. For clustering and classification, you can measure cluster purity or downstream classifier accuracy. Always evaluate on a held-out test set that was not seen during training.

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
)
import numpy as np

def evaluate_retrieval_quality(
    model: SentenceTransformer,
    queries: dict,       # {qid: query_text}
    corpus: dict,        # {cid: corpus_text}
    relevant: dict,      # {qid: set(cid1, cid2, ...)}
) -> dict:
    """Evaluate embedding model on retrieval task."""
    evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant,
        name="domain-retrieval",
        ndcg_at_k=[1, 5, 10],
        recall_at_k=[1, 5, 10, 50],
        mrr_at_k=[10],
        show_progress_bar=True,
    )

    results = evaluator(model)
    return results

def compare_models(
    model_names: list,
    queries: dict,
    corpus: dict,
    relevant: dict,
):
    """Compare multiple embedding models on the same evaluation set."""
    results = {}
    for name in model_names:
        model = SentenceTransformer(name)
        metrics = evaluate_retrieval_quality(model, queries, corpus, relevant)
        results[name] = {
            "NDCG@10": metrics.get("domain-retrieval_ndcg@10", 0),
            "Recall@10": metrics.get("domain-retrieval_recall@10", 0),
            "MRR@10": metrics.get("domain-retrieval_mrr@10", 0),
        }
        print(f"\n{name}:")
        for metric, value in results[name].items():
            print(f"  {metric}: {value:.4f}")

    return results

# Example comparison
# compare_models(
#     ["BAAI/bge-base-en-v1.5", "./models/medical-embeddings-final"],
#     queries, corpus, relevant_docs
# )

5. When to Fine-Tune vs. Use Off-the-Shelf

Embedding Decision Framework Is domain specialized? No Use off-the-shelf Yes Have training pairs? No Generate pairs with LLM, then FT Yes (>1K) Fine-tune embeddings Always benchmark off-the-shelf first to measure the improvement gap
Figure 13.12: Decision framework for choosing between off-the-shelf and fine-tuned embeddings
⚠ Warning

Fine-tuned embeddings need reindexing. If you fine-tune your embedding model, all previously computed embeddings in your vector database become stale. You must recompute embeddings for your entire corpus using the new model and reindex them. For large corpora (millions of documents), this can take hours and significant compute. Plan for this cost before committing to embedding fine-tuning, and establish a reindexing pipeline that can run incrementally.

# Practical: deciding whether to fine-tune embeddings
def should_finetune_embeddings(
    baseline_ndcg: float,
    target_ndcg: float,
    corpus_size: int,
    num_training_pairs: int,
    reindex_cost_hours: float,
) -> dict:
    """Decision helper for embedding fine-tuning."""
    gap = target_ndcg - baseline_ndcg
    has_enough_data = num_training_pairs >= 1000
    gap_is_significant = gap > 0.05

    recommendation = "off-the-shelf"
    reasons = []

    if not gap_is_significant:
        reasons.append("Gap to target is small (<5%); fine-tuning unlikely to help")
    elif not has_enough_data:
        reasons.append("Need at least 1,000 training pairs; consider generating "
                       "synthetic pairs with an LLM")
        recommendation = "generate_data_first"
    else:
        expected_improvement = min(gap * 1.5, 0.25)  # Conservative estimate
        expected_ndcg = baseline_ndcg + expected_improvement
        if expected_ndcg >= target_ndcg:
            recommendation = "fine-tune"
            reasons.append(f"Expected NDCG after fine-tuning: ~{expected_ndcg:.2f}")
        else:
            recommendation = "fine-tune + improve retrieval pipeline"
            reasons.append("Fine-tuning alone may not close the gap; "
                          "consider hybrid retrieval (BM25 + dense)")

    reasons.append(f"Reindexing will take ~{reindex_cost_hours:.1f} hours "
                   f"for {corpus_size:,} documents")

    return {
        "recommendation": recommendation,
        "baseline": baseline_ndcg,
        "target": target_ndcg,
        "gap": gap,
        "reasons": reasons,
    }

result = should_finetune_embeddings(
    baseline_ndcg=0.38,
    target_ndcg=0.55,
    corpus_size=500_000,
    num_training_pairs=5_000,
    reindex_cost_hours=3.5,
)
for k, v in result.items():
    print(f"  {k}: {v}")
recommendation: fine-tune baseline: 0.38 target: 0.55 gap: 0.17000000000000004 reasons: ['Expected NDCG after fine-tuning: ~0.63', 'Reindexing will take ~3.5 hours for 500,000 documents']

Section 13.5 Quiz

1. Why do specialized domains benefit more from fine-tuned embeddings than general-purpose domains?
Show Answer
Off-the-shelf embedding models are trained on broad web data, which means they understand general-purpose notions of text similarity well. Specialized domains, however, have unique vocabulary, abbreviations, and semantic relationships that are underrepresented in the training data. For example, in medical text, "MI" means myocardial infarction, not a state abbreviation. Fine-tuning teaches the model these domain-specific semantic distinctions, leading to much larger improvements in specialized domains than in general ones.
2. What is the advantage of using MultipleNegativesRankingLoss over basic triplet loss?
Show Answer
MultipleNegativesRankingLoss treats all other examples in the batch as negatives, providing many more negative examples per training step. With a batch size of 32, each anchor gets 31 negatives instead of just 1. This creates a harder and more informative training signal, leading to better embeddings with fewer training examples. It also eliminates the need to explicitly mine hard negatives in your training data.
3. What pooling strategy is typically used for decoder-only models when computing sentence embeddings?
Show Answer
Decoder-only models typically use last-token pooling, where the hidden state of the final token (usually the EOS token) serves as the sentence embedding. Because of the causal attention mask, only the last token has "seen" all previous tokens in the sequence, making it the most informationally rich representation of the entire input. Some approaches also use mean pooling over all token representations, which can work but requires careful handling of the attention mask.
4. What practical cost must you account for when switching to fine-tuned embeddings in a production system?
Show Answer
You must recompute and reindex all embeddings in your vector database. Since the fine-tuned model produces different embeddings than the original model, all previously stored vectors become stale and incompatible with queries encoded by the new model. For large corpora (millions of documents), this can require hours of compute time and careful orchestration to avoid downtime. You need a reindexing pipeline that can handle this process, ideally with the ability to run incrementally or with blue-green deployment.
5. A team has 200 query-document pairs for a legal search system. Is this enough to fine-tune an embedding model?
Show Answer
200 pairs is generally not enough for effective embedding fine-tuning. The minimum recommended is approximately 1,000 pairs, with 5,000 to 10,000 being ideal. With only 200 pairs, the team should first generate synthetic training pairs using an LLM (e.g., prompting GPT-4 to create query-passage pairs from their legal corpus), then combine those with the 200 real pairs for fine-tuning. They should also benchmark off-the-shelf models first to confirm that fine-tuning is actually needed for their use case.

Key Takeaways