Information extraction (IE) turns unstructured text into structured data. For decades, IE relied on rule-based patterns, statistical models (CRFs, BiLSTMs), and curated ontologies. LLMs have transformed this landscape by enabling zero-shot extraction with natural language instructions. However, LLMs introduce new challenges: inconsistent output formats, hallucinated entities, and high per-token costs. The hybrid approach combines the speed and precision of classical NLP for well-defined entity types with the flexibility of LLMs for complex, open-ended extraction tasks. Libraries like Instructor, BAML, and Pydantic provide the structured output guarantees that production systems require.
1. The Information Extraction Landscape
Information extraction encompasses several related tasks that transform free text into structured records. Named Entity Recognition (NER) identifies and classifies spans of text into categories such as persons, organizations, locations, and dates. Relation extraction identifies semantic connections between entities (e.g., "Alice works at Acme Corp"). Event extraction captures structured representations of what happened, when, where, and to whom. Each task can be approached with classical NLP tools, LLM prompting, or a combination of both.
1.1 Classical IE vs. LLM-Based IE
| Dimension | Classical IE (spaCy, CRF) | LLM-Based IE |
|---|---|---|
| Setup cost | High: labeled data, training pipelines | Low: prompt engineering, few examples |
| Entity types | Fixed at training time | Flexible, defined in the prompt |
| Latency | Sub-millisecond per document | 100ms to 2s per document |
| Cost per doc | Negligible (CPU inference) | $0.001 to $0.05 per document |
| Accuracy (common entities) | 95%+ F1 on trained types | 85-92% F1 zero-shot |
| Accuracy (novel types) | 0% (needs retraining) | 75-90% F1 zero-shot |
| Output format | Deterministic, typed spans | Requires structured output enforcement |
| Hallucination risk | None (span-based) | Moderate (can invent entities) |
| Context window | Unlimited (streaming) | Limited by model context length |
2. Classical IE with spaCy
spaCy remains the gold standard for production NER when you need speed and reliability on well-defined entity types. Its transformer-based models achieve state-of-the-art accuracy on standard benchmarks, and its pipeline architecture makes it easy to add custom entity types through training or rule-based matching.
import spacy
from spacy import displacy
from collections import defaultdict
# Load a pre-trained transformer model
nlp = spacy.load("en_core_web_trf")
text = """
Apple Inc. announced today that CEO Tim Cook will present the company's
quarterly earnings at their headquarters in Cupertino, California on
January 30, 2025. Revenue is expected to exceed $120 billion, driven
by strong iPhone 16 sales across Europe and Asia.
"""
doc = nlp(text)
# Extract entities with their labels and positions
entities = []
for ent in doc.ents:
entities.append({
"text": ent.text,
"label": ent.label_,
"start": ent.start_char,
"end": ent.end_char,
})
# Group by entity type
by_type = defaultdict(list)
for e in entities:
by_type[e["label"]].append(e["text"])
print("Extracted Entities:")
print("=" * 50)
for label, values in sorted(by_type.items()):
print(f" {label:12s}: {', '.join(values)}")
print(f"\nTotal: {len(entities)} entities across {len(by_type)} types")
spaCy's transformer models (like en_core_web_trf) use RoBERTa under the hood and achieve 90%+ F1 on OntoNotes 5.0. For production systems processing millions of documents, the smaller en_core_web_sm model trades a few accuracy points for 10x faster inference and minimal memory footprint. Choose based on your latency and accuracy requirements.
3. LLM-Based Extraction with Structured Output
LLMs can extract entities and relations that classical models were never trained on. The key challenge is ensuring that the output conforms to a predictable schema. Three libraries have emerged as production standards for this problem: Pydantic for schema definition, Instructor for OpenAI/Anthropic structured output, and BAML for type-safe LLM function definitions.
3.1 Pydantic Schemas for Extraction
Pydantic models define the exact shape of the data you want to extract. By declaring your output schema as a Python class, you get automatic validation, type coercion, and clear error messages when the LLM produces malformed output.
from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional
class EntityType(str, Enum):
PERSON = "person"
ORGANIZATION = "organization"
LOCATION = "location"
DATE = "date"
MONEY = "money"
PRODUCT = "product"
EVENT = "event"
class Entity(BaseModel):
text: str = Field(description="The entity text as it appears in the source")
entity_type: EntityType = Field(description="The semantic type of the entity")
confidence: float = Field(ge=0.0, le=1.0, description="Extraction confidence")
class Relation(BaseModel):
subject: str = Field(description="The subject entity text")
predicate: str = Field(description="The relationship type (e.g., works_at, located_in)")
object: str = Field(description="The object entity text")
confidence: float = Field(ge=0.0, le=1.0)
class ExtractionResult(BaseModel):
"""Complete structured extraction from a document."""
entities: list[Entity] = Field(default_factory=list)
relations: list[Relation] = Field(default_factory=list)
summary: Optional[str] = Field(
None, description="One-sentence summary of the document"
)
# Validate a sample extraction
result = ExtractionResult(
entities=[
Entity(text="Tim Cook", entity_type="person", confidence=0.98),
Entity(text="Apple Inc.", entity_type="organization", confidence=0.99),
Entity(text="Cupertino", entity_type="location", confidence=0.95),
],
relations=[
Relation(
subject="Tim Cook", predicate="ceo_of",
object="Apple Inc.", confidence=0.97
),
Relation(
subject="Apple Inc.", predicate="headquartered_in",
object="Cupertino", confidence=0.94
),
],
summary="Apple CEO Tim Cook to present quarterly earnings in Cupertino."
)
print(result.model_dump_json(indent=2))
3.2 Instructor: Structured Output from LLMs
Instructor patches OpenAI and Anthropic clients to return Pydantic objects directly, handling JSON schema generation, response parsing, and automatic retries on validation failure. This eliminates the manual prompt engineering needed to coerce LLMs into producing valid JSON.
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field
# Patch the OpenAI client for structured output
client = instructor.from_openai(OpenAI())
class MedicalEntity(BaseModel):
name: str = Field(description="Entity name as it appears in text")
category: str = Field(description="One of: condition, medication, procedure, anatomy")
negated: bool = Field(description="True if the entity is negated (e.g., 'no fever')")
class ClinicalExtraction(BaseModel):
entities: list[MedicalEntity]
icd_codes: list[str] = Field(
description="Likely ICD-10 codes based on the extracted conditions"
)
note = """
Patient presents with acute chest pain radiating to the left arm.
No fever or shortness of breath. History of hypertension managed
with lisinopril 10mg daily. ECG shows ST-segment elevation.
Recommend immediate cardiac catheterization.
"""
# Instructor handles schema injection, parsing, and validation
extraction = client.chat.completions.create(
model="gpt-4o",
response_model=ClinicalExtraction,
messages=[
{"role": "system", "content": "Extract medical entities from clinical notes."},
{"role": "user", "content": note},
],
max_retries=2, # Auto-retry on validation failure
)
print(f"Entities found: {len(extraction.entities)}")
for ent in extraction.entities:
neg = " [NEGATED]" if ent.negated else ""
print(f" {ent.category:12s}: {ent.name}{neg}")
print(f"\nICD-10 codes: {', '.join(extraction.icd_codes)}")
Instructor's max_retries parameter is crucial for production reliability. When the LLM returns JSON that fails Pydantic validation (missing fields, wrong types, out-of-range values), Instructor automatically sends the validation error back to the LLM and asks it to fix the response. This retry loop typically resolves 95%+ of parsing failures without human intervention.
3.3 BAML: Type-Safe LLM Functions
BAML (Basically, A Made-up Language) takes a different approach by defining LLM functions in a dedicated schema language that compiles to type-safe client code. This separates prompt logic from application code and provides compile-time guarantees about the expected input/output types.
To get started with BAML, install it and initialize a project:
# Install BAML and its compiler
# pip install baml-py
# npx @boundaryml/baml init (creates baml_src/ directory)
# After defining your .baml files in baml_src/, compile them:
# npx @boundaryml/baml generate (generates baml_client/ Python package)
# The generated baml_client/ package contains:
# - Type-safe Python classes for all your BAML types
# - A client object 'b' with methods for each BAML function
# - Async variants for all functions (b.async_.ExtractEvents)
# BAML definition file: extract_events.baml
# This compiles to a type-safe Python client
#
# class EventType(str, Enum):
# ACQUISITION = "acquisition"
# PARTNERSHIP = "partnership"
# PRODUCT_LAUNCH = "product_launch"
# EARNINGS = "earnings"
# LEGAL = "legal"
#
# class ExtractedEvent(BaseModel):
# event_type: EventType
# description: str
# participants: list[str]
# date: Optional[str]
# monetary_value: Optional[str]
#
# Usage with the compiled BAML client:
from baml_client import b
from baml_client.types import ExtractedEvent
article = """
Microsoft announced on March 15, 2025, that it has completed its
$2.1 billion acquisition of cybersecurity startup CyberShield AI.
The deal, first reported in January, brings 450 employees and
several enterprise security products into Microsoft's Azure division.
CEO Satya Nadella called the acquisition transformative for the
company's cloud security strategy.
"""
# BAML handles prompt construction, LLM call, and type validation
events: list[ExtractedEvent] = b.ExtractEvents(article)
for event in events:
print(f"Type: {event.event_type}")
print(f"Description: {event.description}")
print(f"Participants: {', '.join(event.participants)}")
print(f"Date: {event.date}")
print(f"Value: {event.monetary_value}")
LLMs can hallucinate entities that do not appear in the source text. Always implement a grounding check that verifies extracted entities against the original document. A simple substring match catches most hallucinations. For more robust grounding, use fuzzy matching or semantic similarity to handle paraphrases and abbreviations.
4. Hybrid IE Architectures
The most effective production IE systems combine classical and LLM-based extraction in a layered architecture. Classical models handle the high-volume, well-defined entity types (persons, organizations, dates, locations) at near-zero cost, while LLMs are called selectively for complex, domain-specific extraction tasks that require reasoning or world knowledge.
4.1 Building the Hybrid Pipeline
import spacy
from pydantic import BaseModel, Field
from typing import Optional
from dataclasses import dataclass
# Assume 'client' is an Instructor-patched OpenAI client
# client = instructor.from_openai(OpenAI())
nlp = spacy.load("en_core_web_trf")
class DomainEntity(BaseModel):
text: str
entity_type: str
source: str = Field(description="'classical' or 'llm'")
confidence: float
class RelationTriple(BaseModel):
subject: str
predicate: str
object: str
class HybridExtractionResult(BaseModel):
entities: list[DomainEntity]
relations: list[RelationTriple]
# Mapping from spaCy labels to our unified schema
SPACY_LABEL_MAP = {
"PERSON": "person", "ORG": "organization",
"GPE": "location", "LOC": "location",
"DATE": "date", "MONEY": "money",
"PRODUCT": "product",
}
# Domain-specific types that require LLM extraction
DOMAIN_TYPES = {"medical_condition", "legal_clause", "financial_instrument"}
def needs_llm_extraction(text: str, classical_entities: list) -> bool:
"""Decide whether to invoke the LLM for deeper extraction."""
# Heuristic: call LLM if the document contains domain keywords
# that classical NER cannot handle
domain_keywords = [
"diagnosis", "plaintiff", "defendant", "derivative",
"ct scan", "mri", "statute", "breach of contract",
]
text_lower = text.lower()
return any(kw in text_lower for kw in domain_keywords)
def extract_classical(text: str) -> list[DomainEntity]:
"""Fast, cheap extraction using spaCy."""
doc = nlp(text)
entities = []
for ent in doc.ents:
if ent.label_ in SPACY_LABEL_MAP:
entities.append(DomainEntity(
text=ent.text,
entity_type=SPACY_LABEL_MAP[ent.label_],
source="classical",
confidence=0.95,
))
return entities
def extract_with_llm(text: str, existing: list[DomainEntity]) -> HybridExtractionResult:
"""LLM extraction for domain-specific types and relations."""
existing_summary = ", ".join(f"{e.text} ({e.entity_type})" for e in existing)
return client.chat.completions.create(
model="gpt-4o",
response_model=HybridExtractionResult,
messages=[
{"role": "system", "content": (
"Extract domain-specific entities and relations from the text. "
f"These entities were already found by NER: {existing_summary}. "
"Focus on entities and relations NOT already captured. "
"Mark all entities with source='llm'."
)},
{"role": "user", "content": text},
],
max_retries=2,
)
def hybrid_extract(text: str) -> HybridExtractionResult:
"""Two-layer hybrid extraction pipeline."""
# Layer 1: Classical NER (always runs, near-zero cost)
classical = extract_classical(text)
# Layer 2: LLM extraction (conditional, only when needed)
if needs_llm_extraction(text, classical):
llm_result = extract_with_llm(text, classical)
# Merge: classical entities + LLM entities + LLM relations
all_entities = classical + llm_result.entities
return HybridExtractionResult(
entities=all_entities,
relations=llm_result.relations,
)
# Simple case: return classical entities only
return HybridExtractionResult(entities=classical, relations=[])
# Example usage
text = """
Dr. Sarah Chen at Massachusetts General Hospital diagnosed the patient
with Stage II non-small cell lung cancer based on the CT scan results
from January 15, 2025. Treatment with pembrolizumab was initiated.
"""
result = hybrid_extract(text)
print(f"Entities ({len(result.entities)}):")
for e in result.entities:
print(f" [{e.source:9s}] {e.entity_type:20s}: {e.text}")
print(f"\nRelations ({len(result.relations)}):")
for r in result.relations:
print(f" {r.subject} -> {r.predicate} -> {r.object}")
The hybrid architecture delivers large cost savings because the complexity router filters out 60-80% of documents at the classical layer. Only documents that contain domain-specific signals (medical terms, legal language, financial instruments) trigger the more expensive LLM call. For a pipeline processing 100K documents/day, this means the LLM handles only 20-40K documents, reducing API costs by 60-80% compared to an LLM-only approach.
5. Production Deployment Patterns
Deploying IE systems to production requires attention to grounding, deduplication, and graceful degradation. These patterns ensure that extraction results are reliable even when individual components fail.
5.1 Grounding Verification
Every entity extracted by an LLM should be verified against the source text. This prevents hallucinated entities from entering your structured data store.
- Exact substring check: verify that the entity text appears verbatim in the source document. Fast and simple, but misses abbreviations and paraphrases.
- Fuzzy matching: use edit distance or token overlap to handle minor variations (e.g., "Dr. Chen" vs. "Sarah Chen"). Set a threshold of 0.8 similarity.
- Semantic grounding: compute embedding similarity between the extracted entity and all noun phrases in the source. Most robust, but adds latency.
5.2 Graceful Degradation
When the LLM is unavailable or returns invalid output after retries, the system should fall back to classical extraction rather than failing entirely. This means your pipeline always returns at least the entities that spaCy can identify, even during LLM outages. Log all fallback events so you can measure how often they occur and what extraction quality looks like without the LLM layer.
Never store LLM-extracted entities at the same confidence level as classical entities unless they pass grounding verification. Downstream consumers of your structured data need to distinguish between high-confidence, span-grounded entities and lower-confidence, LLM-inferred entities. Include the source and confidence fields in every entity record.
6. End-to-End Example: Financial Event Extraction
To illustrate a complete production pipeline, consider extracting structured financial events from news articles. This requires recognizing standard entities (companies, dates, monetary values) and domain-specific events (acquisitions, IPOs, earnings reports) with their associated attributes.
Cross-document entity resolution (deduplication) is critical for IE systems that process streams of news articles. The same company may appear as "Microsoft," "Microsoft Corp.," "MSFT," or "the Redmond-based tech giant." Use a combination of string normalization, alias dictionaries, and embedding similarity to link these mentions to a canonical entity ID.
Knowledge Check
Show Answer
Show Answer
max_retries parameter. When the LLM returns JSON that fails Pydantic validation (missing required fields, wrong types, or values outside specified ranges), Instructor sends the validation error message back to the LLM and asks it to produce a corrected response. This approach resolves the vast majority of parsing failures without manual intervention. If all retries are exhausted, Instructor raises a validation exception that the calling code can handle.Show Answer
Show Answer
Show Answer
Key Takeaways
- Information extraction turns unstructured text into structured records through NER, relation extraction, and event extraction. Classical and LLM approaches each have distinct strengths.
- Classical NER (spaCy, CRF) delivers sub-millisecond latency and 95%+ F1 on trained entity types with zero hallucination risk, making it ideal for high-volume production extraction.
- LLM-based extraction enables zero-shot extraction of novel entity types, relations, and events, but requires structured output enforcement (Pydantic, Instructor, BAML) to produce reliable schemas.
- Hybrid pipelines run classical NER on every document, then selectively invoke LLMs only for documents requiring domain-specific extraction. This reduces API costs by 60-80% while maintaining broad coverage.
- Grounding verification is essential for LLM-extracted entities. Always check that extracted text can be traced back to the source document before storing it in production databases.
- Production IE systems must implement graceful degradation (falling back to classical extraction during LLM outages), entity resolution (deduplication across documents), and confidence-aware storage (distinguishing high-confidence classical entities from lower-confidence LLM extractions).
Module 11 has covered the core patterns for combining classical ML with LLMs. The emerging frontier is compound AI systems: multi-component architectures where retrieval, classification, generation, and verification modules work together as a coordinated pipeline. Frameworks like DSPy (covered in Section 10.3) are evolving to support production deployment of these compound systems. The broader trend is "AI engineering" as a distinct discipline, combining ML engineering, prompt engineering, and systems design. Part IV covers training and fine-tuning, which is the next lever you can pull when prompt engineering and hybrid architectures reach their limits.
Most LLM courses teach you how to use LLMs. This module taught you when NOT to use them, and how to combine them with traditional ML for production efficiency. The triage routing, cascade, and Pareto frontier analysis patterns covered here are rarely found in textbooks but are standard practice in cost-conscious production systems. The consistent pattern across all five sections: start cheap and simple, escalate to expensive and powerful only when needed.