Embeddings are the backbone of modern search, retrieval, and recommendation systems. While off-the-shelf embedding models work well for general text, domain-specific applications often benefit enormously from fine-tuned embeddings that understand the nuances of your particular domain. A legal search engine needs embeddings that distinguish between subtly different contract clauses; a medical retrieval system needs embeddings that capture clinical relationships. This section covers why and how to fine-tune models for better representations.
1. Why Fine-Tune for Representations?
Off-the-shelf embedding models like OpenAI's text-embedding-3 or open-source models like BGE and E5 are trained on broad web data. They produce good general-purpose embeddings, but they may not capture the semantic distinctions that matter in your specific domain. Fine-tuning teaches the model which texts should be similar and which should be different according to your application's needs.
1.1 When Off-the-Shelf Falls Short
General embedding models struggle in several common scenarios. Domain-specific vocabulary (medical terms, legal jargon, internal company terminology) may be poorly represented. The notion of "similarity" may differ from the general case: in a customer support system, two tickets describing the same bug should be similar even if they use completely different language. In a patent search system, documents covering the same invention should cluster together despite varying levels of technical detail.
| Scenario | Off-the-Shelf Performance | After Fine-Tuning | Improvement |
|---|---|---|---|
| General web search | NDCG@10: 0.52 | Not needed | N/A |
| Medical literature retrieval | NDCG@10: 0.38 | NDCG@10: 0.56 | +47% |
| Legal clause matching | NDCG@10: 0.31 | NDCG@10: 0.54 | +74% |
| Internal docs search | NDCG@10: 0.42 | NDCG@10: 0.61 | +45% |
| Customer support dedup | F1: 0.65 | F1: 0.84 | +29% |
The more specialized your domain, the more you benefit from fine-tuning. If your domain vocabulary overlaps heavily with general web text (e.g., product reviews, news articles), off-the-shelf embeddings will work reasonably well. But if your domain has specialized terminology, unusual notions of similarity, or if retrieval precision is critical to your application, fine-tuning can yield 30% to 70% improvements.
2. Encoder-Only vs. Decoder-Only for Embeddings
Historically, encoder-only models (BERT, RoBERTa) dominated the embedding space because their bidirectional architecture naturally produces rich token representations. Decoder-only models (GPT, Llama) are autoregressive and were not originally designed for embeddings. However, recent work has shown that decoder-only models can produce competitive embeddings with the right training approach.
| Aspect | Encoder-Only | Decoder-Only |
|---|---|---|
| Architecture | Bidirectional (BERT, RoBERTa) | Causal/autoregressive (Llama, Mistral) |
| Pooling strategy | [CLS] token or mean pooling | Last token or mean pooling |
| Typical model size | 100M to 400M parameters | 1B to 70B parameters |
| Embedding dimension | 768 to 1024 | 2048 to 8192 |
| Max sequence length | 512 tokens (typical) | 4K to 128K tokens |
| Inference speed | Fast (small model) | Slower (large model) |
| Quality for retrieval | Excellent with fine-tuning | Competitive with fine-tuning |
| Best for | High-throughput retrieval | When you already have the model deployed |
3. Contrastive Learning for Embeddings
The standard approach for fine-tuning embedding models is contrastive learning. The core idea is simple: train the model so that embeddings of semantically similar texts are close together, while embeddings of dissimilar texts are far apart. This is achieved through carefully constructed training pairs and a contrastive loss function.
3.1 Training Data: Pairs and Triplets
# Preparing contrastive training data
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class ContrastivePair:
"""A pair of texts with a similarity label."""
anchor: str # The query or reference text
positive: str # Text that should be similar to anchor
negative: str = None # Text that should be different (for triplet loss)
score: float = 1.0 # Similarity score (0 to 1) for soft labels
# Example: medical retrieval training data
medical_pairs = [
ContrastivePair(
anchor="What are the symptoms of type 2 diabetes?",
positive="Type 2 diabetes symptoms include increased thirst, frequent "
"urination, blurred vision, fatigue, and slow wound healing.",
negative="Type 1 diabetes is an autoimmune condition where the immune "
"system attacks insulin-producing beta cells in the pancreas."
),
ContrastivePair(
anchor="Treatment options for hypertension",
positive="First-line treatments for high blood pressure include ACE "
"inhibitors, ARBs, calcium channel blockers, and thiazide "
"diuretics, often combined with lifestyle modifications.",
negative="Hypotension, or low blood pressure, is typically treated by "
"increasing fluid intake and wearing compression stockings."
),
]
# Convert to the format expected by Sentence Transformers
def pairs_to_dataset(pairs: List[ContrastivePair]) -> dict:
"""Convert contrastive pairs to training format."""
anchors = [p.anchor for p in pairs]
positives = [p.positive for p in pairs]
negatives = [p.negative for p in pairs if p.negative]
if negatives:
return {
"anchor": anchors,
"positive": positives,
"negative": negatives,
}
return {
"anchor": anchors,
"positive": positives,
}
3.2 Fine-Tuning with Sentence Transformers
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
losses,
)
from datasets import Dataset
# Load a pre-trained embedding model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
# Prepare training dataset
train_data = Dataset.from_dict({
"anchor": [
"What causes migraines?",
"Side effects of metformin",
"How to diagnose celiac disease",
],
"positive": [
"Migraines are caused by abnormal brain activity affecting nerve "
"signals, chemicals, and blood vessels in the brain.",
"Common side effects of metformin include nausea, diarrhea, stomach "
"pain, and a metallic taste in the mouth.",
"Celiac disease is diagnosed through blood tests for specific "
"antibodies (tTG-IgA) followed by an intestinal biopsy.",
],
"negative": [
"Tension headaches are the most common type of headache and are "
"usually caused by muscle tension in the head and neck.",
"Metformin is a first-line medication for type 2 diabetes that "
"works by reducing glucose production in the liver.",
"Irritable bowel syndrome is a functional disorder affecting the "
"large intestine with symptoms of cramping and bloating.",
],
})
# Choose a loss function
# MultipleNegativesRankingLoss: best for pairs (anchor, positive)
# TripletLoss: for (anchor, positive, negative) triplets
loss = losses.MultipleNegativesRankingLoss(model)
# Configure training
training_args = SentenceTransformerTrainingArguments(
output_dir="./models/medical-embeddings",
num_train_epochs=3,
per_device_train_batch_size=32,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=True,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
logging_steps=10,
)
# Train
trainer = SentenceTransformerTrainer(
model=model,
args=training_args,
train_dataset=train_data,
loss=loss,
)
trainer.train()
model.save_pretrained("./models/medical-embeddings-final")
MultipleNegativesRankingLoss is the workhorse. This loss function treats all other examples in the batch as negatives, meaning you get many negative pairs "for free" from each batch. With a batch size of 32, each anchor has 1 positive and 31 in-batch negatives. This is why larger batch sizes generally produce better embedding models: more negatives lead to a harder and more informative contrastive signal.
4. Evaluating Embedding Quality
Evaluating fine-tuned embeddings requires task-specific metrics. For retrieval, the standard metrics are NDCG@k, Recall@k, and MRR. For clustering and classification, you can measure cluster purity or downstream classifier accuracy. Always evaluate on a held-out test set that was not seen during training.
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
InformationRetrievalEvaluator,
)
import numpy as np
def evaluate_retrieval_quality(
model: SentenceTransformer,
queries: dict, # {qid: query_text}
corpus: dict, # {cid: corpus_text}
relevant: dict, # {qid: set(cid1, cid2, ...)}
) -> dict:
"""Evaluate embedding model on retrieval task."""
evaluator = InformationRetrievalEvaluator(
queries=queries,
corpus=corpus,
relevant_docs=relevant,
name="domain-retrieval",
ndcg_at_k=[1, 5, 10],
recall_at_k=[1, 5, 10, 50],
mrr_at_k=[10],
show_progress_bar=True,
)
results = evaluator(model)
return results
def compare_models(
model_names: list,
queries: dict,
corpus: dict,
relevant: dict,
):
"""Compare multiple embedding models on the same evaluation set."""
results = {}
for name in model_names:
model = SentenceTransformer(name)
metrics = evaluate_retrieval_quality(model, queries, corpus, relevant)
results[name] = {
"NDCG@10": metrics.get("domain-retrieval_ndcg@10", 0),
"Recall@10": metrics.get("domain-retrieval_recall@10", 0),
"MRR@10": metrics.get("domain-retrieval_mrr@10", 0),
}
print(f"\n{name}:")
for metric, value in results[name].items():
print(f" {metric}: {value:.4f}")
return results
# Example comparison
# compare_models(
# ["BAAI/bge-base-en-v1.5", "./models/medical-embeddings-final"],
# queries, corpus, relevant_docs
# )
5. When to Fine-Tune vs. Use Off-the-Shelf
Fine-tuned embeddings need reindexing. If you fine-tune your embedding model, all previously computed embeddings in your vector database become stale. You must recompute embeddings for your entire corpus using the new model and reindex them. For large corpora (millions of documents), this can take hours and significant compute. Plan for this cost before committing to embedding fine-tuning, and establish a reindexing pipeline that can run incrementally.
# Practical: deciding whether to fine-tune embeddings
def should_finetune_embeddings(
baseline_ndcg: float,
target_ndcg: float,
corpus_size: int,
num_training_pairs: int,
reindex_cost_hours: float,
) -> dict:
"""Decision helper for embedding fine-tuning."""
gap = target_ndcg - baseline_ndcg
has_enough_data = num_training_pairs >= 1000
gap_is_significant = gap > 0.05
recommendation = "off-the-shelf"
reasons = []
if not gap_is_significant:
reasons.append("Gap to target is small (<5%); fine-tuning unlikely to help")
elif not has_enough_data:
reasons.append("Need at least 1,000 training pairs; consider generating "
"synthetic pairs with an LLM")
recommendation = "generate_data_first"
else:
expected_improvement = min(gap * 1.5, 0.25) # Conservative estimate
expected_ndcg = baseline_ndcg + expected_improvement
if expected_ndcg >= target_ndcg:
recommendation = "fine-tune"
reasons.append(f"Expected NDCG after fine-tuning: ~{expected_ndcg:.2f}")
else:
recommendation = "fine-tune + improve retrieval pipeline"
reasons.append("Fine-tuning alone may not close the gap; "
"consider hybrid retrieval (BM25 + dense)")
reasons.append(f"Reindexing will take ~{reindex_cost_hours:.1f} hours "
f"for {corpus_size:,} documents")
return {
"recommendation": recommendation,
"baseline": baseline_ndcg,
"target": target_ndcg,
"gap": gap,
"reasons": reasons,
}
result = should_finetune_embeddings(
baseline_ndcg=0.38,
target_ndcg=0.55,
corpus_size=500_000,
num_training_pairs=5_000,
reindex_cost_hours=3.5,
)
for k, v in result.items():
print(f" {k}: {v}")
Section 13.5 Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- Fine-tuned embeddings provide 30% to 70% improvement over off-the-shelf models in specialized domains where vocabulary and similarity notions differ from general text.
- Encoder-only models (BERT, BGE) remain the practical choice for high-throughput embedding tasks due to their small size and fast inference; decoder-only models are competitive but slower.
- Contrastive learning with MultipleNegativesRankingLoss is the standard approach: it uses in-batch negatives to create a strong training signal without explicit hard negative mining.
- You need at least 1,000 training pairs for effective fine-tuning; if you have fewer, generate synthetic pairs using an LLM before fine-tuning.
- Always benchmark off-the-shelf first to measure the actual performance gap before investing in fine-tuning and the associated reindexing costs.
- Reindexing is a hidden cost: switching to a fine-tuned embedding model requires recomputing embeddings for your entire corpus, which must be planned into the deployment timeline.