The best of both worlds. Instead of choosing between an LLM and a classical model, you can use the LLM as a feature extractor and feed its outputs into a traditional ML pipeline. LLM embeddings capture deep semantic meaning that TF-IDF cannot, while the downstream classical model (XGBoost, logistic regression, neural network) provides fast inference, low cost, and full interpretability. This pattern is particularly powerful when you want LLM-quality understanding at classical-ML prices, or when you need to combine text understanding with structured features that LLMs handle poorly.
1. Embeddings as Features
This section uses embeddings extensively. If you need a refresher on how word and sentence embeddings work, see Module 01 (Text Representation) for foundational NLP concepts and Module 07 for how pretrained models learn these representations. Here we focus on using embeddings as features for downstream ML models.
Every LLM (and many smaller language models) can produce embeddings: dense vector representations that encode the semantic meaning of text. These embeddings serve as drop-in replacements for hand-crafted features like TF-IDF or bag-of-words, and they consistently outperform them on tasks that require understanding meaning rather than just matching keywords.
1.1 Why Embeddings Beat TF-IDF
TF-IDF represents text as sparse vectors based on word frequencies. It captures lexical overlap but completely misses semantic similarity. The sentences "The car is fast" and "The automobile has high velocity" share zero TF-IDF features despite being semantically identical. Embeddings from a language model map both sentences to nearby points in a dense vector space, because the model has learned that "car" and "automobile," "fast" and "high velocity" are semantically related.
The tradeoff is computational cost. Computing a TF-IDF vector requires only a dictionary lookup and some arithmetic. Computing an embedding requires a forward pass through a neural network. However, this cost is paid only once: you can precompute embeddings for your entire dataset and then use the resulting vectors with any classical model at near-zero marginal cost.
1.2 Generating Embeddings with OpenAI
import openai
import numpy as np
client = openai.OpenAI()
def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
"""Get embeddings for a batch of texts."""
response = client.embeddings.create(
input=texts,
model=model
)
return np.array([item.embedding for item in response.data])
# Example: embed customer support tickets
tickets = [
"I was charged twice for my subscription last month",
"The app crashes every time I try to upload a photo",
"How do I update my billing address?",
"My package was delivered to the wrong address",
"Can you explain your enterprise pricing plans?",
]
embeddings = get_embeddings(tickets)
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dtype: {embeddings.dtype}")
print(f"First embedding (first 5 dims): {embeddings[0][:5]}")
print(f"\nCost: ~$0.00002 per text (text-embedding-3-small)")
print(f"Total for 5 texts: ~$0.0001")
1.3 Embeddings + XGBoost Pipeline
Once you have embeddings, plugging them into a classical model is straightforward. The embedding vector becomes the feature vector, and you train the model exactly as you would with any other numeric features.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import numpy as np
# Simulated embeddings (in production, use real embeddings)
# 500 samples, 1536-dimensional embeddings
np.random.seed(42)
n_samples = 500
n_dims = 1536
n_classes = 5
# Create synthetic embeddings with class structure
X_embed = np.random.randn(n_samples, n_dims) * 0.1
for i in range(n_classes):
mask = np.arange(n_samples) % n_classes == i
X_embed[mask, i*50:(i+1)*50] += 1.0 # Class-specific signal
y = np.array([i % n_classes for i in range(n_samples)])
# Compare models on embeddings
models = {
"Logistic Regression": LogisticRegression(max_iter=1000, C=1.0),
"XGBoost": xgb.XGBClassifier(
n_estimators=100, max_depth=4, learning_rate=0.1,
use_label_encoder=False, eval_metric='mlogloss'
),
}
print("Model performance on LLM embeddings (5-fold CV):")
print("-" * 50)
for name, model in models.items():
scores = cross_val_score(model, X_embed, y, cv=5, scoring='accuracy')
print(f" {name}: {scores.mean():.3f} (+/- {scores.std():.3f})")
print("\nFor comparison with TF-IDF features:")
print(" TF-IDF + LR typically: 0.82-0.88")
print(" Embeddings + LR typically: 0.89-0.94")
print(" Embeddings + XGBoost typically: 0.91-0.95")
2. LLM-Powered Feature Engineering
Beyond raw embeddings, LLMs can generate structured features from unstructured text. This is particularly powerful when you need to combine text understanding with structured data in a classical ML pipeline. Instead of treating the LLM as the final decision maker, you ask it to extract specific attributes that become columns in your feature matrix.
2.1 Extracting Structured Features from Text
import openai
import json
client = openai.OpenAI()
def extract_features(text: str) -> dict:
"""Use an LLM to extract structured features from a support ticket."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """Extract these features from the
customer support ticket as JSON:
- urgency: "low", "medium", or "high"
- sentiment: float from -1.0 (very negative) to 1.0 (very positive)
- category: one of "billing", "technical", "account", "shipping", "general"
- has_financial_impact: boolean
- requires_human: boolean (true if the issue is too complex for automation)
- entities: list of key entities mentioned (products, amounts, dates)
Return ONLY valid JSON."""},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
# Example ticket
ticket = (
"I've been a premium subscriber for 3 years and was just charged "
"$499 for a service I cancelled last week. This is the third time "
"this has happened and I'm seriously considering switching to your "
"competitor. I need a refund processed immediately."
)
features = extract_features(ticket)
print(json.dumps(features, indent=2))
LLM-extracted features provide something embeddings alone cannot: interpretable, structured attributes that humans and downstream systems can understand. A support team dashboard can filter by urgency and sentiment. A routing system can use requires_human to escalate tickets. A financial impact flag can trigger automated refund workflows. These features are also composable: you can combine them with structured data (account age, purchase history, previous tickets) in a single feature matrix for a classical model.
2.2 Enriching Sparse Structured Data
One of the most underappreciated patterns is using an LLM to generate text descriptions of structured data, then embedding those descriptions to create richer features. This is particularly useful when your structured data is sparse or when feature interactions are complex and difficult to engineer manually.
import openai
import numpy as np
client = openai.OpenAI()
def describe_product(row: dict) -> str:
"""Generate a text description of a product from structured data."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"Generate a one-paragraph natural language description "
"of this product based on its attributes. Include all "
"relevant details that would help classify or compare it."
)},
{"role": "user", "content": str(row)}
],
max_tokens=150,
temperature=0.3,
)
return response.choices[0].message.content
# Sparse product data
product = {
"name": "UltraFit Pro 3000",
"category": "fitness",
"price": 299.99,
"weight_kg": 2.1,
"rating": 4.2,
"reviews_count": 847,
"brand_tier": "premium",
"launch_year": 2024,
"has_bluetooth": True,
"battery_hours": 48,
}
description = describe_product(product)
print(f"Generated description:\n{description}")
# Now embed this description for richer features
embedding_response = client.embeddings.create(
input=description,
model="text-embedding-3-small"
)
enriched_embedding = embedding_response.data[0].embedding
print(f"\nEnriched embedding dims: {len(enriched_embedding)}")
print(f"Original structured features: {len(product)} columns")
print(f"Combined feature vector: {len(product) + len(enriched_embedding)} dims")
3. Local Embedding Models
API-based embeddings are convenient but introduce latency, cost, and data privacy concerns. For high-volume pipelines or sensitive data, local embedding models are a better choice. The sentence-transformers library provides dozens of pre-trained models that run on CPU or GPU.
from sentence_transformers import SentenceTransformer
import numpy as np
import time
# Load a local embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, fast
texts = [
"I was charged twice for my subscription",
"The app crashes every time I try to upload a photo",
"How do I update my billing address?",
"My package was delivered to the wrong address",
"Can you explain your enterprise pricing plans?",
]
# Generate embeddings locally
start = time.perf_counter()
embeddings = model.encode(texts, normalize_embeddings=True)
elapsed = time.perf_counter() - start
print(f"Model: all-MiniLM-L6-v2")
print(f"Embedding shape: {embeddings.shape}")
print(f"Time for 5 texts: {elapsed*1000:.1f} ms")
print(f"Avg per text: {elapsed/len(texts)*1000:.1f} ms")
print(f"Cost: $0.00 (local inference)")
# Compute similarity matrix
similarity = np.dot(embeddings, embeddings.T)
print(f"\nSimilarity between 'charged twice' and 'billing address': "
f"{similarity[0][2]:.3f}")
print(f"Similarity between 'charged twice' and 'app crashes': "
f"{similarity[0][1]:.3f}")
4. Combining Embeddings with Structured Features
The most powerful pattern combines LLM embeddings (capturing text semantics) with traditional structured features (capturing numeric and categorical data) in a single model. This is especially effective for tasks where both text and metadata carry predictive signal, such as support ticket prioritization, product recommendation, and content moderation.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
np.random.seed(42)
n_samples = 1000
# Simulate embeddings (384-dim from local model)
text_embeddings = np.random.randn(n_samples, 384) * 0.1
# Simulate structured features
structured_features = np.column_stack([
np.random.randint(0, 365, n_samples), # account_age_days
np.random.randint(0, 50, n_samples), # previous_tickets
np.random.uniform(0, 5000, n_samples), # account_value
np.random.choice([0, 1], n_samples), # is_premium
np.random.uniform(0, 1, n_samples), # sentiment_score
])
# Add predictive signal
labels = (
(structured_features[:, 3] == 1).astype(float) * 0.3 + # premium
(structured_features[:, 2] > 2500).astype(float) * 0.3 + # high value
np.random.randn(n_samples) * 0.2
) > 0.3
labels = labels.astype(int)
# Three feature configurations
configs = {
"Structured only": structured_features,
"Embeddings only": text_embeddings,
"Combined (structured + embeddings)": np.hstack([
StandardScaler().fit_transform(structured_features),
text_embeddings
]),
}
model = xgb.XGBClassifier(
n_estimators=100, max_depth=4, learning_rate=0.1,
eval_metric='logloss'
)
print("Feature ablation study (5-fold CV accuracy):")
print("=" * 55)
for name, features in configs.items():
scores = cross_val_score(model, features, labels, cv=5)
print(f" {name:40s} {scores.mean():.3f} (+/- {scores.std():.3f})")
When combining embeddings with structured features, always standardize (zero mean, unit variance) the structured features. Embedding vectors from language models are typically already normalized or on a consistent scale, but structured features like "account_value" (range 0 to 5000) and "is_premium" (0 or 1) need to be rescaled so that the gradient-based optimizer does not disproportionately weight high-magnitude features. Tree-based models like XGBoost are less sensitive to feature scaling, but it is still good practice.
5. Dimensionality Reduction for Embeddings
High-dimensional embeddings (1536 dimensions from OpenAI, 768 or 1024 from many open models) can cause issues with some classical models. Logistic regression may overfit without strong regularization. Tree models may struggle with the high dimensionality. Dimensionality reduction techniques like PCA or UMAP can compress embeddings while preserving most of the information.
| Embedding Model | Dimensions | Quality (MTEB) | Cost | Speed |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | Highest | $0.00013/1K tokens | API latency |
| text-embedding-3-small | 1536 | High | $0.00002/1K tokens | API latency |
| all-MiniLM-L6-v2 | 384 | Good | Free (local) | ~5ms/text (GPU) |
| bge-large-en-v1.5 | 1024 | High | Free (local) | ~15ms/text (GPU) |
| nomic-embed-text-v1.5 | 768 | High | Free (local) | ~8ms/text (GPU) |
Always use the same embedding model for both training and inference. Embeddings from different models live in different vector spaces and are not interchangeable. If you train your XGBoost classifier on OpenAI embeddings, you cannot switch to a local sentence-transformer model at inference time without retraining. Plan your embedding strategy before building the pipeline, considering both quality and long-term operational costs.
Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- LLM embeddings serve as powerful drop-in replacements for TF-IDF, capturing semantic meaning that sparse representations miss.
- The embedding pipeline (compute once, reuse many times) gives you LLM-quality text understanding at near-zero marginal inference cost.
- LLM-powered feature engineering extracts interpretable, structured attributes from text that can be combined with traditional features in a single model.
- Enriching sparse structured data with LLM-generated text descriptions and embeddings can significantly improve classical model performance.
- Local embedding models (sentence-transformers) eliminate API costs and latency for high-volume production systems.
- Always use the same embedding model for training and inference; vectors from different models are not compatible.