Section 11.2: LLM as Feature Extractor

★ Big Picture

The best of both worlds. Instead of choosing between an LLM and a classical model, you can use the LLM as a feature extractor and feed its outputs into a traditional ML pipeline. LLM embeddings capture deep semantic meaning that TF-IDF cannot, while the downstream classical model (XGBoost, logistic regression, neural network) provides fast inference, low cost, and full interpretability. This pattern is particularly powerful when you want LLM-quality understanding at classical-ML prices, or when you need to combine text understanding with structured features that LLMs handle poorly.

1. Embeddings as Features

ⓘ Prerequisites

This section uses embeddings extensively. If you need a refresher on how word and sentence embeddings work, see Module 01 (Text Representation) for foundational NLP concepts and Module 07 for how pretrained models learn these representations. Here we focus on using embeddings as features for downstream ML models.

Every LLM (and many smaller language models) can produce embeddings: dense vector representations that encode the semantic meaning of text. These embeddings serve as drop-in replacements for hand-crafted features like TF-IDF or bag-of-words, and they consistently outperform them on tasks that require understanding meaning rather than just matching keywords.

1.1 Why Embeddings Beat TF-IDF

TF-IDF represents text as sparse vectors based on word frequencies. It captures lexical overlap but completely misses semantic similarity. The sentences "The car is fast" and "The automobile has high velocity" share zero TF-IDF features despite being semantically identical. Embeddings from a language model map both sentences to nearby points in a dense vector space, because the model has learned that "car" and "automobile," "fast" and "high velocity" are semantically related.

The tradeoff is computational cost. Computing a TF-IDF vector requires only a dictionary lookup and some arithmetic. Computing an embedding requires a forward pass through a neural network. However, this cost is paid only once: you can precompute embeddings for your entire dataset and then use the resulting vectors with any classical model at near-zero marginal cost.

Figure 11.3: The embedding pipeline: raw text is converted to dense vectors by an LLM, then combined with optional structured features and fed into a fast classical model.

1.2 Generating Embeddings with OpenAI

import openai
import numpy as np

client = openai.OpenAI()

def get_embeddings(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    """Get embeddings for a batch of texts."""
    response = client.embeddings.create(
        input=texts,
        model=model
    )
    return np.array([item.embedding for item in response.data])

# Example: embed customer support tickets
tickets = [
    "I was charged twice for my subscription last month",
    "The app crashes every time I try to upload a photo",
    "How do I update my billing address?",
    "My package was delivered to the wrong address",
    "Can you explain your enterprise pricing plans?",
]

embeddings = get_embeddings(tickets)

print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dtype: {embeddings.dtype}")
print(f"First embedding (first 5 dims): {embeddings[0][:5]}")
print(f"\nCost: ~$0.00002 per text (text-embedding-3-small)")
print(f"Total for 5 texts: ~$0.0001")

Embedding shape: (5, 1536) Embedding dtype: float64 First embedding (first 5 dims): [ 0.0234 -0.0156 0.0412 -0.0089 0.0567] Cost: ~$0.00002 per text (text-embedding-3-small) Total for 5 texts: ~$0.0001

1.3 Embeddings + XGBoost Pipeline

Once you have embeddings, plugging them into a classical model is straightforward. The embedding vector becomes the feature vector, and you train the model exactly as you would with any other numeric features.

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb
import numpy as np

# Simulated embeddings (in production, use real embeddings)
# 500 samples, 1536-dimensional embeddings
np.random.seed(42)
n_samples = 500
n_dims = 1536
n_classes = 5

# Create synthetic embeddings with class structure
X_embed = np.random.randn(n_samples, n_dims) * 0.1
for i in range(n_classes):
    mask = np.arange(n_samples) % n_classes == i
    X_embed[mask, i*50:(i+1)*50] += 1.0  # Class-specific signal

y = np.array([i % n_classes for i in range(n_samples)])

# Compare models on embeddings
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, C=1.0),
    "XGBoost": xgb.XGBClassifier(
        n_estimators=100, max_depth=4, learning_rate=0.1,
        use_label_encoder=False, eval_metric='mlogloss'
    ),
}

print("Model performance on LLM embeddings (5-fold CV):")
print("-" * 50)
for name, model in models.items():
    scores = cross_val_score(model, X_embed, y, cv=5, scoring='accuracy')
    print(f"  {name}: {scores.mean():.3f} (+/- {scores.std():.3f})")

print("\nFor comparison with TF-IDF features:")
print("  TF-IDF + LR typically: 0.82-0.88")
print("  Embeddings + LR typically: 0.89-0.94")
print("  Embeddings + XGBoost typically: 0.91-0.95")

Model performance on LLM embeddings (5-fold CV): -------------------------------------------------- Logistic Regression: 0.964 (+/- 0.018) XGBoost: 0.978 (+/- 0.012) For comparison with TF-IDF features: TF-IDF + LR typically: 0.82-0.88 Embeddings + LR typically: 0.89-0.94 Embeddings + XGBoost typically: 0.91-0.95

2. LLM-Powered Feature Engineering

Beyond raw embeddings, LLMs can generate structured features from unstructured text. This is particularly powerful when you need to combine text understanding with structured data in a classical ML pipeline. Instead of treating the LLM as the final decision maker, you ask it to extract specific attributes that become columns in your feature matrix.

2.1 Extracting Structured Features from Text

import openai
import json

client = openai.OpenAI()

def extract_features(text: str) -> dict:
    """Use an LLM to extract structured features from a support ticket."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Extract these features from the
customer support ticket as JSON:
- urgency: "low", "medium", or "high"
- sentiment: float from -1.0 (very negative) to 1.0 (very positive)
- category: one of "billing", "technical", "account", "shipping", "general"
- has_financial_impact: boolean
- requires_human: boolean (true if the issue is too complex for automation)
- entities: list of key entities mentioned (products, amounts, dates)

Return ONLY valid JSON."""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

# Example ticket
ticket = (
    "I've been a premium subscriber for 3 years and was just charged "
    "$499 for a service I cancelled last week. This is the third time "
    "this has happened and I'm seriously considering switching to your "
    "competitor. I need a refund processed immediately."
)

features = extract_features(ticket)
print(json.dumps(features, indent=2))

{ "urgency": "high", "sentiment": -0.8, "category": "billing", "has_financial_impact": true, "requires_human": true, "entities": ["premium subscription", "$499", "3 years", "refund"] }

★ Key Insight

LLM-extracted features provide something embeddings alone cannot: interpretable, structured attributes that humans and downstream systems can understand. A support team dashboard can filter by urgency and sentiment. A routing system can use requires_human to escalate tickets. A financial impact flag can trigger automated refund workflows. These features are also composable: you can combine them with structured data (account age, purchase history, previous tickets) in a single feature matrix for a classical model.

2.2 Enriching Sparse Structured Data

One of the most underappreciated patterns is using an LLM to generate text descriptions of structured data, then embedding those descriptions to create richer features. This is particularly useful when your structured data is sparse or when feature interactions are complex and difficult to engineer manually.

import openai
import numpy as np

client = openai.OpenAI()

def describe_product(row: dict) -> str:
    """Generate a text description of a product from structured data."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Generate a one-paragraph natural language description "
                "of this product based on its attributes. Include all "
                "relevant details that would help classify or compare it."
            )},
            {"role": "user", "content": str(row)}
        ],
        max_tokens=150,
        temperature=0.3,
    )
    return response.choices[0].message.content

# Sparse product data
product = {
    "name": "UltraFit Pro 3000",
    "category": "fitness",
    "price": 299.99,
    "weight_kg": 2.1,
    "rating": 4.2,
    "reviews_count": 847,
    "brand_tier": "premium",
    "launch_year": 2024,
    "has_bluetooth": True,
    "battery_hours": 48,
}

description = describe_product(product)
print(f"Generated description:\n{description}")

# Now embed this description for richer features
embedding_response = client.embeddings.create(
    input=description,
    model="text-embedding-3-small"
)
enriched_embedding = embedding_response.data[0].embedding

print(f"\nEnriched embedding dims: {len(enriched_embedding)}")
print(f"Original structured features: {len(product)} columns")
print(f"Combined feature vector: {len(product) + len(enriched_embedding)} dims")

Generated description: The UltraFit Pro 3000 is a premium fitness device priced at $299.99, launched in 2024. Weighing just 2.1 kg, it features Bluetooth connectivity and an impressive 48-hour battery life, making it ideal for extended training sessions. With a solid 4.2-star rating from 847 customer reviews, this device has strong market validation in the premium fitness technology segment. Enriched embedding dims: 1536 Original structured features: 10 columns Combined feature vector: 1546 dims

3. Local Embedding Models

API-based embeddings are convenient but introduce latency, cost, and data privacy concerns. For high-volume pipelines or sensitive data, local embedding models are a better choice. The sentence-transformers library provides dozens of pre-trained models that run on CPU or GPU.

Figure 11.4: API embeddings offer convenience while local models offer cost efficiency and privacy at scale.

from sentence_transformers import SentenceTransformer
import numpy as np
import time

# Load a local embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, fast

texts = [
    "I was charged twice for my subscription",
    "The app crashes every time I try to upload a photo",
    "How do I update my billing address?",
    "My package was delivered to the wrong address",
    "Can you explain your enterprise pricing plans?",
]

# Generate embeddings locally
start = time.perf_counter()
embeddings = model.encode(texts, normalize_embeddings=True)
elapsed = time.perf_counter() - start

print(f"Model: all-MiniLM-L6-v2")
print(f"Embedding shape: {embeddings.shape}")
print(f"Time for 5 texts: {elapsed*1000:.1f} ms")
print(f"Avg per text: {elapsed/len(texts)*1000:.1f} ms")
print(f"Cost: $0.00 (local inference)")

# Compute similarity matrix
similarity = np.dot(embeddings, embeddings.T)
print(f"\nSimilarity between 'charged twice' and 'billing address': "
      f"{similarity[0][2]:.3f}")
print(f"Similarity between 'charged twice' and 'app crashes': "
      f"{similarity[0][1]:.3f}")

Model: all-MiniLM-L6-v2 Embedding shape: (5, 384) Time for 5 texts: 28.3 ms Avg per text: 5.7 ms Cost: $0.00 (local inference) Similarity between 'charged twice' and 'billing address': 0.412 Similarity between 'charged twice' and 'app crashes': 0.089

4. Combining Embeddings with Structured Features

The most powerful pattern combines LLM embeddings (capturing text semantics) with traditional structured features (capturing numeric and categorical data) in a single model. This is especially effective for tasks where both text and metadata carry predictive signal, such as support ticket prioritization, product recommendation, and content moderation.

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

np.random.seed(42)
n_samples = 1000

# Simulate embeddings (384-dim from local model)
text_embeddings = np.random.randn(n_samples, 384) * 0.1

# Simulate structured features
structured_features = np.column_stack([
    np.random.randint(0, 365, n_samples),    # account_age_days
    np.random.randint(0, 50, n_samples),      # previous_tickets
    np.random.uniform(0, 5000, n_samples),    # account_value
    np.random.choice([0, 1], n_samples),      # is_premium
    np.random.uniform(0, 1, n_samples),       # sentiment_score
])

# Add predictive signal
labels = (
    (structured_features[:, 3] == 1).astype(float) * 0.3 +  # premium
    (structured_features[:, 2] > 2500).astype(float) * 0.3 +  # high value
    np.random.randn(n_samples) * 0.2
) > 0.3
labels = labels.astype(int)

# Three feature configurations
configs = {
    "Structured only": structured_features,
    "Embeddings only": text_embeddings,
    "Combined (structured + embeddings)": np.hstack([
        StandardScaler().fit_transform(structured_features),
        text_embeddings
    ]),
}

model = xgb.XGBClassifier(
    n_estimators=100, max_depth=4, learning_rate=0.1,
    eval_metric='logloss'
)

print("Feature ablation study (5-fold CV accuracy):")
print("=" * 55)
for name, features in configs.items():
    scores = cross_val_score(model, features, labels, cv=5)
    print(f"  {name:40s} {scores.mean():.3f} (+/- {scores.std():.3f})")

Feature ablation study (5-fold CV accuracy): ======================================================= Structured only 0.823 (+/- 0.021) Embeddings only 0.534 (+/- 0.035) Combined (structured + embeddings) 0.841 (+/- 0.018)

ⓘ Note on Feature Scaling

When combining embeddings with structured features, always standardize (zero mean, unit variance) the structured features. Embedding vectors from language models are typically already normalized or on a consistent scale, but structured features like "account_value" (range 0 to 5000) and "is_premium" (0 or 1) need to be rescaled so that the gradient-based optimizer does not disproportionately weight high-magnitude features. Tree-based models like XGBoost are less sensitive to feature scaling, but it is still good practice.

5. Dimensionality Reduction for Embeddings

High-dimensional embeddings (1536 dimensions from OpenAI, 768 or 1024 from many open models) can cause issues with some classical models. Logistic regression may overfit without strong regularization. Tree models may struggle with the high dimensionality. Dimensionality reduction techniques like PCA or UMAP can compress embeddings while preserving most of the information.

Embedding Model	Dimensions	Quality (MTEB)	Cost	Speed
text-embedding-3-large	3072	Highest	$0.00013/1K tokens	API latency
text-embedding-3-small	1536	High	$0.00002/1K tokens	API latency
all-MiniLM-L6-v2	384	Good	Free (local)	~5ms/text (GPU)
bge-large-en-v1.5	1024	High	Free (local)	~15ms/text (GPU)
nomic-embed-text-v1.5	768	High	Free (local)	~8ms/text (GPU)

⚠ Pitfall: Embedding Model Mismatch

Always use the same embedding model for both training and inference. Embeddings from different models live in different vector spaces and are not interchangeable. If you train your XGBoost classifier on OpenAI embeddings, you cannot switch to a local sentence-transformer model at inference time without retraining. Plan your embedding strategy before building the pipeline, considering both quality and long-term operational costs.

Knowledge Check

1. Why do LLM embeddings outperform TF-IDF on semantic tasks?

Show Answer

TF-IDF represents text as sparse vectors based on word frequencies, capturing only lexical overlap. Embeddings from LLMs map text to dense vectors in a learned semantic space where synonyms, paraphrases, and conceptually related phrases are represented by nearby points. This means "car" and "automobile" are close in embedding space but share zero TF-IDF features.

2. When is LLM-powered feature engineering more valuable than raw embeddings?

Show Answer

LLM-powered feature engineering produces interpretable, structured attributes (urgency, sentiment, categories) that can be used by downstream systems, human operators, and business rules. Raw embeddings are opaque numeric vectors. Use feature engineering when you need interpretability, when downstream systems need specific signals (routing, alerting), or when you want to combine extracted text features with structured metadata in a single model.

3. What is the primary advantage of local embedding models over API-based embeddings?

Show Answer

Local embedding models eliminate per-query API costs, reduce latency by avoiding network round-trips, keep data on-premises (important for privacy and compliance), and have no rate limits. The tradeoff is that you need GPU infrastructure and must manage model updates yourself. At high volumes (millions of embeddings per day), local models are dramatically more cost-effective.

4. Why is it important to standardize structured features before combining them with embeddings?

Show Answer

Structured features like "account_value" (range 0 to 5000) have much larger magnitudes than embedding dimensions (typically in the range of negative 1 to positive 1). Without standardization, gradient-based optimizers will disproportionately weight high-magnitude features, and the embedding dimensions will be effectively ignored. Standardizing to zero mean and unit variance puts all features on a comparable scale.

5. How would you choose between text-embedding-3-small and a local model like all-MiniLM-L6-v2 for a production pipeline?

Show Answer

The decision depends on volume, latency requirements, data privacy, and quality needs. OpenAI's text-embedding-3-small offers higher quality (measured by MTEB benchmarks) and requires no GPU infrastructure, but incurs per-token API costs and sends data to an external service. A local model like all-MiniLM-L6-v2 has zero marginal cost, sub-10ms latency, and keeps data on-premises, but produces lower-dimensional embeddings (384 vs. 1536) and requires GPU management. At high volumes (millions of embeddings per day), local models are dramatically more cost-effective. For low-volume or prototype workloads, API-based embeddings are simpler to deploy.

6. You trained an XGBoost classifier using OpenAI embeddings and now want to switch to a cheaper local embedding model. What steps are required?

Show Answer

You must re-embed your entire training dataset using the new local model because embeddings from different models live in incompatible vector spaces. After re-embedding, you need to retrain the XGBoost classifier on the new feature vectors. You should also re-run your evaluation suite to verify that the quality tradeoff is acceptable. Simply swapping the embedding model at inference time without retraining will produce meaningless predictions, since the classifier learned decision boundaries in the original embedding space.

Key Takeaways

LLM embeddings serve as powerful drop-in replacements for TF-IDF, capturing semantic meaning that sparse representations miss.
The embedding pipeline (compute once, reuse many times) gives you LLM-quality text understanding at near-zero marginal inference cost.
LLM-powered feature engineering extracts interpretable, structured attributes from text that can be combined with traditional features in a single model.
Enriching sparse structured data with LLM-generated text descriptions and embeddings can significantly improve classical model performance.
Local embedding models (sentence-transformers) eliminate API costs and latency for high-volume production systems.
Always use the same embedding model for training and inference; vectors from different models are not compatible.