Section 11.1: When to Use LLM vs. Classical ML

★ Big Picture

Not every problem needs an LLM. The excitement around large language models has led many teams to reach for GPT-4 or Claude when a logistic regression model trained in five minutes would deliver the same accuracy at 1/1000th the cost and 1/100th the latency. Conversely, some teams stubbornly avoid LLMs for tasks where classical approaches require months of feature engineering to achieve mediocre results, while an LLM solves the problem out of the box. This section provides a structured decision framework for making this choice rigorously, grounded in cost modeling, latency analysis, and empirical benchmarks across common task types.

1. The Decision Framework

Choosing between an LLM and a classical ML model is not a binary decision. It is a multi-dimensional optimization across four axes: accuracy, latency, cost, and interpretability. The right choice depends on the specific requirements of your production system, the volume of requests you need to handle, and the tolerance your users have for errors, delays, and unexplainable outputs.

1.1 The Four Decision Axes

Each axis carries different weight depending on the application. A fraud detection system prioritizes accuracy and interpretability (regulators demand explanations). A customer chatbot prioritizes latency and naturalness. A batch document processing pipeline prioritizes cost per document. Understanding which axes matter most for your use case is the first step in making a good decision.

Accuracy: How correct does the system need to be? Is 95% acceptable, or do you need 99.9%? Does the task have a clear ground truth, or is quality subjective?
Latency: What is the acceptable response time? Real-time applications need sub-100ms responses. Interactive applications tolerate 1 to 3 seconds. Batch processing has no latency constraint.
Cost: What is the per-query cost at your expected volume? A $0.01 per query cost is negligible at 100 queries per day but devastating at 10 million queries per day.
Interpretability: Do you need to explain individual predictions? Regulatory requirements, debugging needs, and user trust all factor into this axis.
Data Privacy: Can you send data to an external API? Organizations subject to GDPR, HIPAA, or financial regulations may be prohibited from transmitting sensitive data to third-party LLM providers. This constraint often tips the decision toward classical models or locally hosted open-source LLMs. Always verify your data governance policy before choosing an API-based approach.

Figure 11.1: The three-way decision between classical ML, LLM, and hybrid approaches depends on data structure, complexity, and volume requirements.

1.2 When Classical ML Wins

Classical machine learning models dominate when the data is structured, the task is well-defined, labeled data is available, and latency or cost constraints are tight. Here are the primary scenarios where classical ML is the better choice:

Tabular prediction: For structured data with numeric and categorical features (credit scoring, churn prediction, demand forecasting), gradient-boosted trees (XGBoost, LightGBM) consistently outperform LLMs. LLMs cannot natively process tabular data without serializing it to text, which introduces noise and destroys the efficient representations that tree models exploit.
High-volume classification with labeled data: When you have thousands of labeled examples for a text classification task, a TF-IDF plus logistic regression pipeline achieves competitive accuracy at a fraction of the cost. At 10 million queries per day, the cost difference between a $0.00001 per query classical model and a $0.01 per query LLM is the difference between $100 and $100,000 per day.
Latency-critical applications: Classical models run inference in microseconds to low milliseconds on CPU. LLMs require tens of milliseconds to seconds depending on output length. For real-time bidding, fraud detection, or recommendation ranking where sub-10ms latency is required, classical ML is the only viable option.
Deterministic extraction: When patterns are regular and well-defined (phone numbers, email addresses, dates, currency amounts), regex and rule-based systems are faster, cheaper, and more reliable than LLMs. They never hallucinate a phone number that does not exist in the input text.

1.3 When LLMs Win

LLMs excel in scenarios that require understanding natural language semantics, handling ambiguity, generalizing from few examples, or producing fluent text output:

Zero-shot and few-shot tasks: When labeled data is scarce or unavailable, LLMs can perform classification, extraction, and summarization using only a task description and a handful of examples in the prompt.
Complex reasoning over text: Tasks that require multi-step reasoning, combining information from different parts of a document, or understanding implicit context are natural fits for LLMs.
Open-ended generation: Producing emails, summaries, creative text, or conversational responses requires the generative capabilities of LLMs. Classical models cannot generate coherent paragraphs.
Ambiguous or subjective tasks: Sentiment analysis with sarcasm, intent detection with ambiguous phrasing, and content moderation with nuanced policy violations all benefit from the broad world knowledge and contextual understanding of LLMs.

2. Empirical Benchmarks: Classification

The best way to make the decision is to benchmark. Let us compare four approaches on a common text classification task: classifying customer support tickets into categories (billing, technical, account, shipping, general).

2.1 TF-IDF + Logistic Regression Baseline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import time
import numpy as np

# Sample labeled data (in production: thousands of examples)
texts = [
    "I was charged twice for my subscription",
    "My app keeps crashing on login",
    "How do I change my email address",
    "Package hasn't arrived after 10 days",
    "What are your business hours",
    "Refund not showing in my account",
    "Error 500 when uploading files",
    "Reset my password please",
    "Tracking number not working",
    "Do you offer student discounts",
] * 100  # Simulate larger dataset

labels = [
    "billing", "technical", "account", "shipping", "general",
    "billing", "technical", "account", "shipping", "general",
] * 100

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Train TF-IDF + Logistic Regression
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=1000, C=1.0)
model.fit(X_train_tfidf, y_train)

# Benchmark inference
start = time.perf_counter()
for _ in range(1000):
    vectorizer.transform(["I was charged twice"])
    model.predict(vectorizer.transform(["I was charged twice"]))
elapsed = time.perf_counter() - start

print(f"Accuracy: {model.score(X_test_tfidf, y_test):.3f}")
print(f"Avg latency: {elapsed / 1000 * 1000:.2f} ms")
print(f"Cost per query: ~$0.000001 (CPU inference)")

Accuracy: 1.000 Avg latency: 0.12 ms Cost per query: ~$0.000001 (CPU inference)

On this straightforward classification task with clean, distinct categories and sufficient training data, the classical approach achieves perfect accuracy (as expected with these separable examples) at sub-millisecond latency and negligible cost. In production with noisier data, accuracy would be lower, but the cost and latency advantages remain massive.

2.2 LLM Few-Shot Classification

import openai
import time

client = openai.OpenAI()

def classify_with_llm(text: str) -> str:
    """Classify a support ticket using GPT-4o-mini."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify the customer support ticket into exactly one "
                "category: billing, technical, account, shipping, general. "
                "Respond with only the category name."
            )},
            {"role": "user", "content": text}
        ],
        max_tokens=10,
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

# Benchmark
test_queries = [
    "I was charged twice for my subscription",
    "My app keeps crashing on login",
    "How do I change my email address",
    "Package hasn't arrived after 10 days",
    "What are your business hours",
]

start = time.perf_counter()
results = [classify_with_llm(q) for q in test_queries]
elapsed = time.perf_counter() - start

for query, result in zip(test_queries, results):
    print(f"  '{query[:40]}...' => {result}")

print(f"\nAvg latency: {elapsed / len(test_queries) * 1000:.0f} ms")
print(f"Cost per query: ~$0.0003 (GPT-4o-mini)")

'I was charged twice for my subscription...' => billing 'My app keeps crashing on login...' => technical 'How do I change my email address...' => account 'Package hasn't arrived after 10 days...' => shipping 'What are your business hours...' => general Avg latency: 450 ms Cost per query: ~$0.0003 (GPT-4o-mini)

★ Key Insight

The LLM achieves the same accuracy on these clear-cut examples, but at 3,750x the latency and 300x the cost. The LLM's advantage appears when categories are ambiguous, descriptions are complex, or labeled training data is unavailable. For a team that needs a classifier running in production tomorrow with no labeled data, the LLM approach is ready immediately, while the classical approach requires a labeling effort first.

3. Tabular Data: Where LLMs Struggle

One of the clearest cases for classical ML is tabular prediction. LLMs were designed to process sequential text, not structured rows of numeric and categorical features. While recent research has explored serializing tabular data into text prompts, the results consistently lag behind purpose-built tree models.

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
import time

# Generate a realistic tabular classification task
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=12,
    n_redundant=4,
    n_classes=3,
    random_state=42,
    class_sep=1.0
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train XGBoost
clf = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    eval_metric='mlogloss'
)
clf.fit(X_train, y_train)

# Benchmark
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

start = time.perf_counter()
for _ in range(10000):
    clf.predict(X_test[:1])
elapsed = time.perf_counter() - start

print(f"XGBoost accuracy: {accuracy:.3f}")
print(f"Avg latency: {elapsed / 10000 * 1000:.3f} ms")
print(f"Cost per query: ~$0.0000001")
print(f"\nFor comparison, serializing 20 features to text")
print(f"and sending to GPT-4o would cost ~$0.002 per query")
print(f"and take ~800ms, with lower accuracy on tabular data.")

XGBoost accuracy: 0.891 Avg latency: 0.008 ms Cost per query: ~$0.0000001 For comparison, serializing 20 features to text and sending to GPT-4o would cost ~$0.002 per query and take ~800ms, with lower accuracy on tabular data.

⚠ Common Mistake

Do not serialize tabular data into text and send it to an LLM for prediction. LLMs lack the inductive biases that make tree models effective on tabular data: they cannot natively split on feature thresholds, handle missing values efficiently, or exploit feature interactions the way gradient-boosted trees do. Research consistently shows that XGBoost and LightGBM outperform LLMs on tabular benchmarks, often by significant margins.

4. Regex vs. LLM for Pattern Extraction

For extracting structured patterns from text, the decision often comes down to how regular the patterns are. Phone numbers, email addresses, dates in known formats, and currency amounts follow predictable patterns that regex handles perfectly. Free-form entity extraction (person names, product mentions, medical conditions) requires the semantic understanding that LLMs provide.

Figure 11.2: The regularity of patterns determines whether regex, hybrid, or LLM-based extraction is appropriate.

import re
import time

text = """
Contact us at support@example.com or call +1 (555) 123-4567.
Invoice total: $1,234.56 due by 2025-03-15.
Reach Jane Smith at jane.smith@company.org for details.
"""

# Regex extraction: deterministic, fast, perfect for regular patterns
patterns = {
    "emails": r'[\w.+-]+@[\w-]+\.[\w.-]+',
    "phones": r'\+?1?\s*\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}',
    "amounts": r'\$[\d,]+\.\d{2}',
    "dates": r'\d{4}-\d{2}-\d{2}',
}

start = time.perf_counter()
for _ in range(10000):
    results = {k: re.findall(v, text) for k, v in patterns.items()}
elapsed = time.perf_counter() - start

for entity_type, matches in results.items():
    print(f"  {entity_type}: {matches}")

print(f"\nAvg latency: {elapsed / 10000 * 1000:.4f} ms")
print(f"False positives: 0 (deterministic)")
print(f"Cost: $0.00 (no API call)")

emails: ['support@example.com', 'jane.smith@company.org'] phones: ['+1 (555) 123-4567'] amounts: ['$1,234.56'] dates: ['2025-03-15'] Avg latency: 0.0035 ms False positives: 0 (deterministic) Cost: $0.00 (no API call)

5. Cost Modeling at Scale

The per-query cost difference between approaches becomes dramatic at scale. A cost model must account for not just API or inference costs, but also the engineering time to build and maintain each approach, infrastructure costs for self-hosted models, and the opportunity cost of latency.

Approach	Per-Query Cost	1K/day	100K/day	10M/day	Latency
Regex / Rules	~$0.000001	$0.03/mo	$3/mo	$300/mo	<0.01 ms
TF-IDF + LR	~$0.00001	$0.30/mo	$30/mo	$3,000/mo	0.1 ms
Fine-tuned BERT	~$0.0001	$3/mo	$300/mo	$30,000/mo	5 ms
GPT-4o-mini	~$0.0003	$9/mo	$900/mo	$90,000/mo	300 ms
GPT-4o	~$0.003	$90/mo	$9,000/mo	$900,000/mo	800 ms

ⓘ Note

These cost estimates assume typical input/output sizes for a classification task (approximately 100 input tokens, 10 output tokens). Costs for generation-heavy tasks (summarization, writing) will be significantly higher due to larger output token counts. Always calculate costs based on your actual token distributions, not averages from benchmarks.

6. The Decision Matrix

Pulling together all the factors above, here is a practical decision matrix. For any new ML task, walk through these questions in order:

Is the pattern regular and well-defined? Use regex or rules. Do not overthink it.
Is the data tabular (structured rows and columns)? Use XGBoost or LightGBM. LLMs are not competitive on tabular data.
Do you have thousands of labeled examples? Train a classical model (TF-IDF + LR for text, BERT for complex text). It will be cheaper and faster.
Is the task zero-shot or few-shot? Use an LLM. It is the only option that works without labeled data.
Does the task require generation or complex reasoning? Use an LLM. Classical models cannot generate coherent text or reason over long documents.
Is volume extremely high (millions per day) and cost matters? Consider fine-tuning a smaller model (BERT or a small LLM) to replace the large LLM, or use a hybrid approach where a classifier handles the easy cases.
None of the above clearly applies? Start with an LLM for rapid prototyping, then evaluate whether a cheaper model can match its performance once you have collected enough labeled data from the LLM's outputs.

★ Key Insight: The LLM Bootstrap Pattern

A common and powerful pattern is to start with an LLM for a new task, use it to generate labeled data, then train a smaller classical or fine-tuned model to replace the LLM for production. This gives you the speed-to-market of LLMs with the cost efficiency of classical ML. The LLM continues to serve as a fallback for edge cases the smaller model is not confident about. We will explore this pattern in depth in Section 11.3.

Knowledge Check

1. A team needs to classify 5 million customer emails per day into 10 categories. They have 50,000 labeled examples. Which approach is most cost-effective?

Show Answer

TF-IDF + Logistic Regression (or a fine-tuned BERT model if higher accuracy is needed). With 50,000 labeled examples, classical models will achieve strong accuracy. At 5 million queries per day, the cost of LLM API calls would be $1,500+ per day for GPT-4o-mini, compared to under $50 per day for a BERT model on a single GPU, or essentially free for TF-IDF + LR on CPU.

2. Why do LLMs perform poorly on tabular prediction tasks compared to XGBoost?

Show Answer

LLMs lack the inductive biases that make tree-based models effective on tabular data. Specifically, LLMs cannot natively split on feature thresholds, handle missing values efficiently, or exploit feature interactions the way gradient-boosted trees do. Serializing tabular rows to text introduces noise and destroys the structured relationships between features. Tree models are also orders of magnitude faster and cheaper for tabular inference.

3. A startup with no labeled data needs to extract product names and prices from competitor websites. What approach should they use?

Show Answer

A hybrid approach: use regex for prices (currency amounts follow regular patterns like $XX.XX) and an LLM for product names (which require semantic understanding of context). Over time, they can use the LLM outputs to build a labeled dataset and train a cheaper NER model to replace the LLM for the product name extraction.

4. What is the "LLM bootstrap pattern" and when should you use it?

Show Answer

The LLM bootstrap pattern uses an LLM to generate initial labels for a new task (where no labeled data exists), then trains a cheaper classical or fine-tuned model on those labels for production use. The LLM serves as a fallback for low-confidence cases. Use this pattern when you need to launch quickly (LLM gives immediate results), but volume will eventually make LLM costs unsustainable. It combines the speed-to-market of LLMs with the cost efficiency of classical ML.

5. At what daily query volume does the cost difference between GPT-4o-mini ($0.0003/query) and TF-IDF+LR ($0.00001/query) exceed $1,000 per month?

Show Answer

The cost difference per query is $0.00029. To reach $1,000/month: $1,000 / $0.00029 / 30 days = approximately 115,000 queries per day. At volumes above 115K queries/day, you save over $1,000/month by using the classical approach. At 1 million queries/day, the monthly savings exceed $8,700.

🛠 Modify and Observe

Experiment with the classification benchmark from this section:

In the TF-IDF + XGBoost code, change max_features from 5000 to 500 and then to 50000. Observe how classification accuracy and training time change. This illustrates the vocabulary size tradeoff in traditional NLP.
Replace XGBoost with a simple LogisticRegression() from sklearn. Compare accuracy and inference latency. For many text classification tasks, logistic regression is surprisingly competitive.
In the cost comparison table, recalculate the daily cost assuming your volume is 100x higher (1 million queries per day). At what volume does the cost difference become prohibitive for an LLM approach?

Key Takeaways

The choice between LLM and classical ML is a multi-dimensional optimization across accuracy, latency, cost, and interpretability; there is no universal best option.
Classical ML wins decisively for tabular data, high-volume classification with labeled data, latency-critical applications, and regular pattern extraction.
LLMs win for zero-shot tasks, complex reasoning, open-ended generation, and ambiguous or subjective classification.
Cost scales linearly with volume: a 300x per-query cost difference becomes $900,000/month vs. $3,000/month at 10 million daily queries.
The LLM bootstrap pattern (start with LLM, collect labels, train cheaper model) combines fast prototyping with long-term cost efficiency.
Always benchmark your specific task. General guidelines help frame the decision, but empirical results on your data are what matter.