Module 11 · Section 11.1

When to Use LLM vs. Classical ML

A principled decision framework for choosing the right tool for each production task
★ Big Picture

Not every problem needs an LLM. The excitement around large language models has led many teams to reach for GPT-4 or Claude when a logistic regression model trained in five minutes would deliver the same accuracy at 1/1000th the cost and 1/100th the latency. Conversely, some teams stubbornly avoid LLMs for tasks where classical approaches require months of feature engineering to achieve mediocre results, while an LLM solves the problem out of the box. This section provides a structured decision framework for making this choice rigorously, grounded in cost modeling, latency analysis, and empirical benchmarks across common task types.

1. The Decision Framework

Choosing between an LLM and a classical ML model is not a binary decision. It is a multi-dimensional optimization across four axes: accuracy, latency, cost, and interpretability. The right choice depends on the specific requirements of your production system, the volume of requests you need to handle, and the tolerance your users have for errors, delays, and unexplainable outputs.

1.1 The Four Decision Axes

Each axis carries different weight depending on the application. A fraud detection system prioritizes accuracy and interpretability (regulators demand explanations). A customer chatbot prioritizes latency and naturalness. A batch document processing pipeline prioritizes cost per document. Understanding which axes matter most for your use case is the first step in making a good decision.

LLM vs. Classical ML Decision Framework Evaluate Task Requirements Classical ML Low cost, low latency LLM High flexibility Hybrid Best of both worlds Structured data High volume Unstructured text Complex reasoning Mixed complexity TF-IDF, XGBoost, LR spaCy, regex, CRF GPT-4, Claude, Llama Few-shot, CoT Classifier triage + LLM escalation
Figure 11.1: The three-way decision between classical ML, LLM, and hybrid approaches depends on data structure, complexity, and volume requirements.

1.2 When Classical ML Wins

Classical machine learning models dominate when the data is structured, the task is well-defined, labeled data is available, and latency or cost constraints are tight. Here are the primary scenarios where classical ML is the better choice:

1.3 When LLMs Win

LLMs excel in scenarios that require understanding natural language semantics, handling ambiguity, generalizing from few examples, or producing fluent text output:

2. Empirical Benchmarks: Classification

The best way to make the decision is to benchmark. Let us compare four approaches on a common text classification task: classifying customer support tickets into categories (billing, technical, account, shipping, general).

2.1 TF-IDF + Logistic Regression Baseline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import time
import numpy as np

# Sample labeled data (in production: thousands of examples)
texts = [
    "I was charged twice for my subscription",
    "My app keeps crashing on login",
    "How do I change my email address",
    "Package hasn't arrived after 10 days",
    "What are your business hours",
    "Refund not showing in my account",
    "Error 500 when uploading files",
    "Reset my password please",
    "Tracking number not working",
    "Do you offer student discounts",
] * 100  # Simulate larger dataset

labels = [
    "billing", "technical", "account", "shipping", "general",
    "billing", "technical", "account", "shipping", "general",
] * 100

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Train TF-IDF + Logistic Regression
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=1000, C=1.0)
model.fit(X_train_tfidf, y_train)

# Benchmark inference
start = time.perf_counter()
for _ in range(1000):
    vectorizer.transform(["I was charged twice"])
    model.predict(vectorizer.transform(["I was charged twice"]))
elapsed = time.perf_counter() - start

print(f"Accuracy: {model.score(X_test_tfidf, y_test):.3f}")
print(f"Avg latency: {elapsed / 1000 * 1000:.2f} ms")
print(f"Cost per query: ~$0.000001 (CPU inference)")
Accuracy: 1.000 Avg latency: 0.12 ms Cost per query: ~$0.000001 (CPU inference)

On this straightforward classification task with clean, distinct categories and sufficient training data, the classical approach achieves perfect accuracy (as expected with these separable examples) at sub-millisecond latency and negligible cost. In production with noisier data, accuracy would be lower, but the cost and latency advantages remain massive.

2.2 LLM Few-Shot Classification

import openai
import time

client = openai.OpenAI()

def classify_with_llm(text: str) -> str:
    """Classify a support ticket using GPT-4o-mini."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify the customer support ticket into exactly one "
                "category: billing, technical, account, shipping, general. "
                "Respond with only the category name."
            )},
            {"role": "user", "content": text}
        ],
        max_tokens=10,
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

# Benchmark
test_queries = [
    "I was charged twice for my subscription",
    "My app keeps crashing on login",
    "How do I change my email address",
    "Package hasn't arrived after 10 days",
    "What are your business hours",
]

start = time.perf_counter()
results = [classify_with_llm(q) for q in test_queries]
elapsed = time.perf_counter() - start

for query, result in zip(test_queries, results):
    print(f"  '{query[:40]}...' => {result}")

print(f"\nAvg latency: {elapsed / len(test_queries) * 1000:.0f} ms")
print(f"Cost per query: ~$0.0003 (GPT-4o-mini)")
'I was charged twice for my subscription...' => billing 'My app keeps crashing on login...' => technical 'How do I change my email address...' => account 'Package hasn't arrived after 10 days...' => shipping 'What are your business hours...' => general Avg latency: 450 ms Cost per query: ~$0.0003 (GPT-4o-mini)
★ Key Insight

The LLM achieves the same accuracy on these clear-cut examples, but at 3,750x the latency and 300x the cost. The LLM's advantage appears when categories are ambiguous, descriptions are complex, or labeled training data is unavailable. For a team that needs a classifier running in production tomorrow with no labeled data, the LLM approach is ready immediately, while the classical approach requires a labeling effort first.

3. Tabular Data: Where LLMs Struggle

One of the clearest cases for classical ML is tabular prediction. LLMs were designed to process sequential text, not structured rows of numeric and categorical features. While recent research has explored serializing tabular data into text prompts, the results consistently lag behind purpose-built tree models.

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
import time

# Generate a realistic tabular classification task
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=12,
    n_redundant=4,
    n_classes=3,
    random_state=42,
    class_sep=1.0
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train XGBoost
clf = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    eval_metric='mlogloss'
)
clf.fit(X_train, y_train)

# Benchmark
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

start = time.perf_counter()
for _ in range(10000):
    clf.predict(X_test[:1])
elapsed = time.perf_counter() - start

print(f"XGBoost accuracy: {accuracy:.3f}")
print(f"Avg latency: {elapsed / 10000 * 1000:.3f} ms")
print(f"Cost per query: ~$0.0000001")
print(f"\nFor comparison, serializing 20 features to text")
print(f"and sending to GPT-4o would cost ~$0.002 per query")
print(f"and take ~800ms, with lower accuracy on tabular data.")
XGBoost accuracy: 0.891 Avg latency: 0.008 ms Cost per query: ~$0.0000001 For comparison, serializing 20 features to text and sending to GPT-4o would cost ~$0.002 per query and take ~800ms, with lower accuracy on tabular data.
⚠ Common Mistake

Do not serialize tabular data into text and send it to an LLM for prediction. LLMs lack the inductive biases that make tree models effective on tabular data: they cannot natively split on feature thresholds, handle missing values efficiently, or exploit feature interactions the way gradient-boosted trees do. Research consistently shows that XGBoost and LightGBM outperform LLMs on tabular benchmarks, often by significant margins.

4. Regex vs. LLM for Pattern Extraction

For extracting structured patterns from text, the decision often comes down to how regular the patterns are. Phone numbers, email addresses, dates in known formats, and currency amounts follow predictable patterns that regex handles perfectly. Free-form entity extraction (person names, product mentions, medical conditions) requires the semantic understanding that LLMs provide.

Pattern Regularity Spectrum Highly Regular Semi-structured Unstructured Regex / Rules Phone numbers Email addresses Dates (ISO format) Currency amounts URLs, IPs Latency: <0.01ms Accuracy: 99.9%+ Cost: ~$0 Hybrid Addresses (varied formats) Product codes Dates (natural language) Company names Regex pre-filter + LLM refinement LLM Required Person names in context Medical conditions Sentiment with sarcasm Implied relationships Ambiguous references Latency: 200ms+ Accuracy: 85-95% Cost: $0.001+
Figure 11.2: The regularity of patterns determines whether regex, hybrid, or LLM-based extraction is appropriate.
import re
import time

text = """
Contact us at support@example.com or call +1 (555) 123-4567.
Invoice total: $1,234.56 due by 2025-03-15.
Reach Jane Smith at jane.smith@company.org for details.
"""

# Regex extraction: deterministic, fast, perfect for regular patterns
patterns = {
    "emails": r'[\w.+-]+@[\w-]+\.[\w.-]+',
    "phones": r'\+?1?\s*\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}',
    "amounts": r'\$[\d,]+\.\d{2}',
    "dates": r'\d{4}-\d{2}-\d{2}',
}

start = time.perf_counter()
for _ in range(10000):
    results = {k: re.findall(v, text) for k, v in patterns.items()}
elapsed = time.perf_counter() - start

for entity_type, matches in results.items():
    print(f"  {entity_type}: {matches}")

print(f"\nAvg latency: {elapsed / 10000 * 1000:.4f} ms")
print(f"False positives: 0 (deterministic)")
print(f"Cost: $0.00 (no API call)")
emails: ['support@example.com', 'jane.smith@company.org'] phones: ['+1 (555) 123-4567'] amounts: ['$1,234.56'] dates: ['2025-03-15'] Avg latency: 0.0035 ms False positives: 0 (deterministic) Cost: $0.00 (no API call)

5. Cost Modeling at Scale

The per-query cost difference between approaches becomes dramatic at scale. A cost model must account for not just API or inference costs, but also the engineering time to build and maintain each approach, infrastructure costs for self-hosted models, and the opportunity cost of latency.

Approach Per-Query Cost 1K/day 100K/day 10M/day Latency
Regex / Rules ~$0.000001 $0.03/mo $3/mo $300/mo <0.01 ms
TF-IDF + LR ~$0.00001 $0.30/mo $30/mo $3,000/mo 0.1 ms
Fine-tuned BERT ~$0.0001 $3/mo $300/mo $30,000/mo 5 ms
GPT-4o-mini ~$0.0003 $9/mo $900/mo $90,000/mo 300 ms
GPT-4o ~$0.003 $90/mo $9,000/mo $900,000/mo 800 ms
ⓘ Note

These cost estimates assume typical input/output sizes for a classification task (approximately 100 input tokens, 10 output tokens). Costs for generation-heavy tasks (summarization, writing) will be significantly higher due to larger output token counts. Always calculate costs based on your actual token distributions, not averages from benchmarks.

6. The Decision Matrix

Pulling together all the factors above, here is a practical decision matrix. For any new ML task, walk through these questions in order:

  1. Is the pattern regular and well-defined? Use regex or rules. Do not overthink it.
  2. Is the data tabular (structured rows and columns)? Use XGBoost or LightGBM. LLMs are not competitive on tabular data.
  3. Do you have thousands of labeled examples? Train a classical model (TF-IDF + LR for text, BERT for complex text). It will be cheaper and faster.
  4. Is the task zero-shot or few-shot? Use an LLM. It is the only option that works without labeled data.
  5. Does the task require generation or complex reasoning? Use an LLM. Classical models cannot generate coherent text or reason over long documents.
  6. Is volume extremely high (millions per day) and cost matters? Consider fine-tuning a smaller model (BERT or a small LLM) to replace the large LLM, or use a hybrid approach where a classifier handles the easy cases.
  7. None of the above clearly applies? Start with an LLM for rapid prototyping, then evaluate whether a cheaper model can match its performance once you have collected enough labeled data from the LLM's outputs.
★ Key Insight: The LLM Bootstrap Pattern

A common and powerful pattern is to start with an LLM for a new task, use it to generate labeled data, then train a smaller classical or fine-tuned model to replace the LLM for production. This gives you the speed-to-market of LLMs with the cost efficiency of classical ML. The LLM continues to serve as a fallback for edge cases the smaller model is not confident about. We will explore this pattern in depth in Section 11.3.

Knowledge Check

1. A team needs to classify 5 million customer emails per day into 10 categories. They have 50,000 labeled examples. Which approach is most cost-effective?
Show Answer
TF-IDF + Logistic Regression (or a fine-tuned BERT model if higher accuracy is needed). With 50,000 labeled examples, classical models will achieve strong accuracy. At 5 million queries per day, the cost of LLM API calls would be $1,500+ per day for GPT-4o-mini, compared to under $50 per day for a BERT model on a single GPU, or essentially free for TF-IDF + LR on CPU.
2. Why do LLMs perform poorly on tabular prediction tasks compared to XGBoost?
Show Answer
LLMs lack the inductive biases that make tree-based models effective on tabular data. Specifically, LLMs cannot natively split on feature thresholds, handle missing values efficiently, or exploit feature interactions the way gradient-boosted trees do. Serializing tabular rows to text introduces noise and destroys the structured relationships between features. Tree models are also orders of magnitude faster and cheaper for tabular inference.
3. A startup with no labeled data needs to extract product names and prices from competitor websites. What approach should they use?
Show Answer
A hybrid approach: use regex for prices (currency amounts follow regular patterns like $XX.XX) and an LLM for product names (which require semantic understanding of context). Over time, they can use the LLM outputs to build a labeled dataset and train a cheaper NER model to replace the LLM for the product name extraction.
4. What is the "LLM bootstrap pattern" and when should you use it?
Show Answer
The LLM bootstrap pattern uses an LLM to generate initial labels for a new task (where no labeled data exists), then trains a cheaper classical or fine-tuned model on those labels for production use. The LLM serves as a fallback for low-confidence cases. Use this pattern when you need to launch quickly (LLM gives immediate results), but volume will eventually make LLM costs unsustainable. It combines the speed-to-market of LLMs with the cost efficiency of classical ML.
5. At what daily query volume does the cost difference between GPT-4o-mini ($0.0003/query) and TF-IDF+LR ($0.00001/query) exceed $1,000 per month?
Show Answer
The cost difference per query is $0.00029. To reach $1,000/month: $1,000 / $0.00029 / 30 days = approximately 115,000 queries per day. At volumes above 115K queries/day, you save over $1,000/month by using the classical approach. At 1 million queries/day, the monthly savings exceed $8,700.
🛠 Modify and Observe

Experiment with the classification benchmark from this section:

  1. In the TF-IDF + XGBoost code, change max_features from 5000 to 500 and then to 50000. Observe how classification accuracy and training time change. This illustrates the vocabulary size tradeoff in traditional NLP.
  2. Replace XGBoost with a simple LogisticRegression() from sklearn. Compare accuracy and inference latency. For many text classification tasks, logistic regression is surprisingly competitive.
  3. In the cost comparison table, recalculate the daily cost assuming your volume is 100x higher (1 million queries per day). At what volume does the cost difference become prohibitive for an LLM approach?

Key Takeaways