Section 11.5: Structured Information Extraction

★ Big Picture

Information extraction (IE) turns unstructured text into structured data. For decades, IE relied on rule-based patterns, statistical models (CRFs, BiLSTMs), and curated ontologies. LLMs have transformed this landscape by enabling zero-shot extraction with natural language instructions. However, LLMs introduce new challenges: inconsistent output formats, hallucinated entities, and high per-token costs. The hybrid approach combines the speed and precision of classical NLP for well-defined entity types with the flexibility of LLMs for complex, open-ended extraction tasks. Libraries like Instructor, BAML, and Pydantic provide the structured output guarantees that production systems require.

1. The Information Extraction Landscape

Information extraction encompasses several related tasks that transform free text into structured records. Named Entity Recognition (NER) identifies and classifies spans of text into categories such as persons, organizations, locations, and dates. Relation extraction identifies semantic connections between entities (e.g., "Alice works at Acme Corp"). Event extraction captures structured representations of what happened, when, where, and to whom. Each task can be approached with classical NLP tools, LLM prompting, or a combination of both.

1.1 Classical IE vs. LLM-Based IE

Dimension	Classical IE (spaCy, CRF)	LLM-Based IE
Setup cost	High: labeled data, training pipelines	Low: prompt engineering, few examples
Entity types	Fixed at training time	Flexible, defined in the prompt
Latency	Sub-millisecond per document	100ms to 2s per document
Cost per doc	Negligible (CPU inference)	$0.001 to $0.05 per document
Accuracy (common entities)	95%+ F1 on trained types	85-92% F1 zero-shot
Accuracy (novel types)	0% (needs retraining)	75-90% F1 zero-shot
Output format	Deterministic, typed spans	Requires structured output enforcement
Hallucination risk	None (span-based)	Moderate (can invent entities)
Context window	Unlimited (streaming)	Limited by model context length

Figure 11.10: Classical NER pipelines offer deterministic, sub-millisecond inference on trained entity types, while LLM pipelines provide flexible schema extraction at higher latency and cost.

2. Classical IE with spaCy

spaCy remains the gold standard for production NER when you need speed and reliability on well-defined entity types. Its transformer-based models achieve state-of-the-art accuracy on standard benchmarks, and its pipeline architecture makes it easy to add custom entity types through training or rule-based matching.

import spacy
from spacy import displacy
from collections import defaultdict

# Load a pre-trained transformer model
nlp = spacy.load("en_core_web_trf")

text = """
Apple Inc. announced today that CEO Tim Cook will present the company's
quarterly earnings at their headquarters in Cupertino, California on
January 30, 2025. Revenue is expected to exceed $120 billion, driven
by strong iPhone 16 sales across Europe and Asia.
"""

doc = nlp(text)

# Extract entities with their labels and positions
entities = []
for ent in doc.ents:
    entities.append({
        "text": ent.text,
        "label": ent.label_,
        "start": ent.start_char,
        "end": ent.end_char,
    })

# Group by entity type
by_type = defaultdict(list)
for e in entities:
    by_type[e["label"]].append(e["text"])

print("Extracted Entities:")
print("=" * 50)
for label, values in sorted(by_type.items()):
    print(f"  {label:12s}: {', '.join(values)}")

print(f"\nTotal: {len(entities)} entities across {len(by_type)} types")

Extracted Entities: ================================================== CARDINAL : 120 billion, 16 DATE : today, January 30, 2025 GPE : Cupertino, California, Europe, Asia MONEY : $120 billion ORG : Apple Inc. PERSON : Tim Cook PRODUCT : iPhone 16 Total: 11 entities across 7 types

📚 Note

spaCy's transformer models (like en_core_web_trf) use RoBERTa under the hood and achieve 90%+ F1 on OntoNotes 5.0. For production systems processing millions of documents, the smaller en_core_web_sm model trades a few accuracy points for 10x faster inference and minimal memory footprint. Choose based on your latency and accuracy requirements.

3. LLM-Based Extraction with Structured Output

LLMs can extract entities and relations that classical models were never trained on. The key challenge is ensuring that the output conforms to a predictable schema. Three libraries have emerged as production standards for this problem: Pydantic for schema definition, Instructor for OpenAI/Anthropic structured output, and BAML for type-safe LLM function definitions.

3.1 Pydantic Schemas for Extraction

Pydantic models define the exact shape of the data you want to extract. By declaring your output schema as a Python class, you get automatic validation, type coercion, and clear error messages when the LLM produces malformed output.

from pydantic import BaseModel, Field
from enum import Enum
from typing import Optional

class EntityType(str, Enum):
    PERSON = "person"
    ORGANIZATION = "organization"
    LOCATION = "location"
    DATE = "date"
    MONEY = "money"
    PRODUCT = "product"
    EVENT = "event"

class Entity(BaseModel):
    text: str = Field(description="The entity text as it appears in the source")
    entity_type: EntityType = Field(description="The semantic type of the entity")
    confidence: float = Field(ge=0.0, le=1.0, description="Extraction confidence")

class Relation(BaseModel):
    subject: str = Field(description="The subject entity text")
    predicate: str = Field(description="The relationship type (e.g., works_at, located_in)")
    object: str = Field(description="The object entity text")
    confidence: float = Field(ge=0.0, le=1.0)

class ExtractionResult(BaseModel):
    """Complete structured extraction from a document."""
    entities: list[Entity] = Field(default_factory=list)
    relations: list[Relation] = Field(default_factory=list)
    summary: Optional[str] = Field(
        None, description="One-sentence summary of the document"
    )

# Validate a sample extraction
result = ExtractionResult(
    entities=[
        Entity(text="Tim Cook", entity_type="person", confidence=0.98),
        Entity(text="Apple Inc.", entity_type="organization", confidence=0.99),
        Entity(text="Cupertino", entity_type="location", confidence=0.95),
    ],
    relations=[
        Relation(
            subject="Tim Cook", predicate="ceo_of",
            object="Apple Inc.", confidence=0.97
        ),
        Relation(
            subject="Apple Inc.", predicate="headquartered_in",
            object="Cupertino", confidence=0.94
        ),
    ],
    summary="Apple CEO Tim Cook to present quarterly earnings in Cupertino."
)

print(result.model_dump_json(indent=2))

{ "entities": [ {"text": "Tim Cook", "entity_type": "person", "confidence": 0.98}, {"text": "Apple Inc.", "entity_type": "organization", "confidence": 0.99}, {"text": "Cupertino", "entity_type": "location", "confidence": 0.95} ], "relations": [ {"subject": "Tim Cook", "predicate": "ceo_of", "object": "Apple Inc.", "confidence": 0.97}, {"subject": "Apple Inc.", "predicate": "headquartered_in", "object": "Cupertino", "confidence": 0.94} ], "summary": "Apple CEO Tim Cook to present quarterly earnings in Cupertino." }

3.2 Instructor: Structured Output from LLMs

Instructor patches OpenAI and Anthropic clients to return Pydantic objects directly, handling JSON schema generation, response parsing, and automatic retries on validation failure. This eliminates the manual prompt engineering needed to coerce LLMs into producing valid JSON.

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field

# Patch the OpenAI client for structured output
client = instructor.from_openai(OpenAI())

class MedicalEntity(BaseModel):
    name: str = Field(description="Entity name as it appears in text")
    category: str = Field(description="One of: condition, medication, procedure, anatomy")
    negated: bool = Field(description="True if the entity is negated (e.g., 'no fever')")

class ClinicalExtraction(BaseModel):
    entities: list[MedicalEntity]
    icd_codes: list[str] = Field(
        description="Likely ICD-10 codes based on the extracted conditions"
    )

note = """
Patient presents with acute chest pain radiating to the left arm.
No fever or shortness of breath. History of hypertension managed
with lisinopril 10mg daily. ECG shows ST-segment elevation.
Recommend immediate cardiac catheterization.
"""

# Instructor handles schema injection, parsing, and validation
extraction = client.chat.completions.create(
    model="gpt-4o",
    response_model=ClinicalExtraction,
    messages=[
        {"role": "system", "content": "Extract medical entities from clinical notes."},
        {"role": "user", "content": note},
    ],
    max_retries=2,  # Auto-retry on validation failure
)

print(f"Entities found: {len(extraction.entities)}")
for ent in extraction.entities:
    neg = " [NEGATED]" if ent.negated else ""
    print(f"  {ent.category:12s}: {ent.name}{neg}")
print(f"\nICD-10 codes: {', '.join(extraction.icd_codes)}")

Entities found: 7 condition : acute chest pain anatomy : left arm condition : fever [NEGATED] condition : shortness of breath [NEGATED] condition : hypertension medication : lisinopril 10mg procedure : cardiac catheterization ICD-10 codes: I21.9, I10, Z79.899

★ Key Insight

Instructor's max_retries parameter is crucial for production reliability. When the LLM returns JSON that fails Pydantic validation (missing fields, wrong types, out-of-range values), Instructor automatically sends the validation error back to the LLM and asks it to fix the response. This retry loop typically resolves 95%+ of parsing failures without human intervention.

3.3 BAML: Type-Safe LLM Functions

BAML (Basically, A Made-up Language) takes a different approach by defining LLM functions in a dedicated schema language that compiles to type-safe client code. This separates prompt logic from application code and provides compile-time guarantees about the expected input/output types.

To get started with BAML, install it and initialize a project:

# Install BAML and its compiler
# pip install baml-py
# npx @boundaryml/baml init    (creates baml_src/ directory)

# After defining your .baml files in baml_src/, compile them:
# npx @boundaryml/baml generate  (generates baml_client/ Python package)

# The generated baml_client/ package contains:
#   - Type-safe Python classes for all your BAML types
#   - A client object 'b' with methods for each BAML function
#   - Async variants for all functions (b.async_.ExtractEvents)

# BAML definition file: extract_events.baml
# This compiles to a type-safe Python client
#
# class EventType(str, Enum):
#     ACQUISITION = "acquisition"
#     PARTNERSHIP = "partnership"
#     PRODUCT_LAUNCH = "product_launch"
#     EARNINGS = "earnings"
#     LEGAL = "legal"
#
# class ExtractedEvent(BaseModel):
#     event_type: EventType
#     description: str
#     participants: list[str]
#     date: Optional[str]
#     monetary_value: Optional[str]
#
# Usage with the compiled BAML client:

from baml_client import b
from baml_client.types import ExtractedEvent

article = """
Microsoft announced on March 15, 2025, that it has completed its
$2.1 billion acquisition of cybersecurity startup CyberShield AI.
The deal, first reported in January, brings 450 employees and
several enterprise security products into Microsoft's Azure division.
CEO Satya Nadella called the acquisition transformative for the
company's cloud security strategy.
"""

# BAML handles prompt construction, LLM call, and type validation
events: list[ExtractedEvent] = b.ExtractEvents(article)

for event in events:
    print(f"Type:         {event.event_type}")
    print(f"Description:  {event.description}")
    print(f"Participants: {', '.join(event.participants)}")
    print(f"Date:         {event.date}")
    print(f"Value:        {event.monetary_value}")

Type: acquisition Description: Microsoft completed acquisition of CyberShield AI Participants: Microsoft, CyberShield AI, Satya Nadella Date: March 15, 2025 Value: $2.1 billion

⚠ Warning

LLMs can hallucinate entities that do not appear in the source text. Always implement a grounding check that verifies extracted entities against the original document. A simple substring match catches most hallucinations. For more robust grounding, use fuzzy matching or semantic similarity to handle paraphrases and abbreviations.

4. Hybrid IE Architectures

The most effective production IE systems combine classical and LLM-based extraction in a layered architecture. Classical models handle the high-volume, well-defined entity types (persons, organizations, dates, locations) at near-zero cost, while LLMs are called selectively for complex, domain-specific extraction tasks that require reasoning or world knowledge.

Figure 11.11: A hybrid IE architecture routes documents through classical NER first, then selectively invokes LLM extraction only for complex documents requiring domain-specific entity types, relations, or event detection.

4.1 Building the Hybrid Pipeline

import spacy
from pydantic import BaseModel, Field
from typing import Optional
from dataclasses import dataclass

# Assume 'client' is an Instructor-patched OpenAI client
# client = instructor.from_openai(OpenAI())

nlp = spacy.load("en_core_web_trf")

class DomainEntity(BaseModel):
    text: str
    entity_type: str
    source: str = Field(description="'classical' or 'llm'")
    confidence: float

class RelationTriple(BaseModel):
    subject: str
    predicate: str
    object: str

class HybridExtractionResult(BaseModel):
    entities: list[DomainEntity]
    relations: list[RelationTriple]

# Mapping from spaCy labels to our unified schema
SPACY_LABEL_MAP = {
    "PERSON": "person", "ORG": "organization",
    "GPE": "location", "LOC": "location",
    "DATE": "date", "MONEY": "money",
    "PRODUCT": "product",
}

# Domain-specific types that require LLM extraction
DOMAIN_TYPES = {"medical_condition", "legal_clause", "financial_instrument"}

def needs_llm_extraction(text: str, classical_entities: list) -> bool:
    """Decide whether to invoke the LLM for deeper extraction."""
    # Heuristic: call LLM if the document contains domain keywords
    # that classical NER cannot handle
    domain_keywords = [
        "diagnosis", "plaintiff", "defendant", "derivative",
        "ct scan", "mri", "statute", "breach of contract",
    ]
    text_lower = text.lower()
    return any(kw in text_lower for kw in domain_keywords)

def extract_classical(text: str) -> list[DomainEntity]:
    """Fast, cheap extraction using spaCy."""
    doc = nlp(text)
    entities = []
    for ent in doc.ents:
        if ent.label_ in SPACY_LABEL_MAP:
            entities.append(DomainEntity(
                text=ent.text,
                entity_type=SPACY_LABEL_MAP[ent.label_],
                source="classical",
                confidence=0.95,
            ))
    return entities

def extract_with_llm(text: str, existing: list[DomainEntity]) -> HybridExtractionResult:
    """LLM extraction for domain-specific types and relations."""
    existing_summary = ", ".join(f"{e.text} ({e.entity_type})" for e in existing)

    return client.chat.completions.create(
        model="gpt-4o",
        response_model=HybridExtractionResult,
        messages=[
            {"role": "system", "content": (
                "Extract domain-specific entities and relations from the text. "
                f"These entities were already found by NER: {existing_summary}. "
                "Focus on entities and relations NOT already captured. "
                "Mark all entities with source='llm'."
            )},
            {"role": "user", "content": text},
        ],
        max_retries=2,
    )

def hybrid_extract(text: str) -> HybridExtractionResult:
    """Two-layer hybrid extraction pipeline."""
    # Layer 1: Classical NER (always runs, near-zero cost)
    classical = extract_classical(text)

    # Layer 2: LLM extraction (conditional, only when needed)
    if needs_llm_extraction(text, classical):
        llm_result = extract_with_llm(text, classical)
        # Merge: classical entities + LLM entities + LLM relations
        all_entities = classical + llm_result.entities
        return HybridExtractionResult(
            entities=all_entities,
            relations=llm_result.relations,
        )

    # Simple case: return classical entities only
    return HybridExtractionResult(entities=classical, relations=[])

# Example usage
text = """
Dr. Sarah Chen at Massachusetts General Hospital diagnosed the patient
with Stage II non-small cell lung cancer based on the CT scan results
from January 15, 2025. Treatment with pembrolizumab was initiated.
"""

result = hybrid_extract(text)
print(f"Entities ({len(result.entities)}):")
for e in result.entities:
    print(f"  [{e.source:9s}] {e.entity_type:20s}: {e.text}")
print(f"\nRelations ({len(result.relations)}):")
for r in result.relations:
    print(f"  {r.subject} -> {r.predicate} -> {r.object}")

Entities (7): [classical] person : Dr. Sarah Chen [classical] organization : Massachusetts General Hospital [classical] date : January 15, 2025 [llm ] medical_condition : Stage II non-small cell lung cancer [llm ] medical_procedure : CT scan [llm ] medication : pembrolizumab [llm ] medical_procedure : treatment initiation Relations (3): Dr. Sarah Chen -> diagnosed -> Stage II non-small cell lung cancer Dr. Sarah Chen -> works_at -> Massachusetts General Hospital pembrolizumab -> treats -> Stage II non-small cell lung cancer

★ Key Insight

The hybrid architecture delivers large cost savings because the complexity router filters out 60-80% of documents at the classical layer. Only documents that contain domain-specific signals (medical terms, legal language, financial instruments) trigger the more expensive LLM call. For a pipeline processing 100K documents/day, this means the LLM handles only 20-40K documents, reducing API costs by 60-80% compared to an LLM-only approach.

5. Production Deployment Patterns

Deploying IE systems to production requires attention to grounding, deduplication, and graceful degradation. These patterns ensure that extraction results are reliable even when individual components fail.

5.1 Grounding Verification

Every entity extracted by an LLM should be verified against the source text. This prevents hallucinated entities from entering your structured data store.

Exact substring check: verify that the entity text appears verbatim in the source document. Fast and simple, but misses abbreviations and paraphrases.
Fuzzy matching: use edit distance or token overlap to handle minor variations (e.g., "Dr. Chen" vs. "Sarah Chen"). Set a threshold of 0.8 similarity.
Semantic grounding: compute embedding similarity between the extracted entity and all noun phrases in the source. Most robust, but adds latency.

5.2 Graceful Degradation

When the LLM is unavailable or returns invalid output after retries, the system should fall back to classical extraction rather than failing entirely. This means your pipeline always returns at least the entities that spaCy can identify, even during LLM outages. Log all fallback events so you can measure how often they occur and what extraction quality looks like without the LLM layer.

⚠ Warning

Never store LLM-extracted entities at the same confidence level as classical entities unless they pass grounding verification. Downstream consumers of your structured data need to distinguish between high-confidence, span-grounded entities and lower-confidence, LLM-inferred entities. Include the source and confidence fields in every entity record.

6. End-to-End Example: Financial Event Extraction

To illustrate a complete production pipeline, consider extracting structured financial events from news articles. This requires recognizing standard entities (companies, dates, monetary values) and domain-specific events (acquisitions, IPOs, earnings reports) with their associated attributes.

Figure 11.12: A four-stage financial event extraction pipeline that combines classical NER, LLM-based event typing, schema validation, and cross-document entity resolution.

📚 Note

Cross-document entity resolution (deduplication) is critical for IE systems that process streams of news articles. The same company may appear as "Microsoft," "Microsoft Corp.," "MSFT," or "the Redmond-based tech giant." Use a combination of string normalization, alias dictionaries, and embedding similarity to link these mentions to a canonical entity ID.

Knowledge Check

1. What is the primary advantage of classical NER (spaCy/CRF) over LLM-based extraction for well-defined entity types?

Show Answer

Classical NER offers sub-millisecond latency, near-zero marginal cost, deterministic output, and 95%+ F1 accuracy on entity types it was trained on. It produces span-based extractions grounded directly in the source text, eliminating hallucination risk. These properties make it the preferred choice for high-volume extraction of standard entity types like persons, organizations, dates, and locations.

2. How does Instructor handle LLM responses that fail Pydantic validation?

Show Answer

Instructor implements an automatic retry loop controlled by the max_retries parameter. When the LLM returns JSON that fails Pydantic validation (missing required fields, wrong types, or values outside specified ranges), Instructor sends the validation error message back to the LLM and asks it to produce a corrected response. This approach resolves the vast majority of parsing failures without manual intervention. If all retries are exhausted, Instructor raises a validation exception that the calling code can handle.

3. Why is grounding verification essential for LLM-extracted entities?

Show Answer

LLMs can hallucinate entities that do not appear in the source text. Unlike classical NER, which extracts contiguous text spans that are by definition present in the document, LLMs generate text that may include inferred or fabricated entities. Grounding verification checks that each extracted entity text can be traced back to the source document through exact substring matching, fuzzy matching, or semantic similarity. Without grounding checks, hallucinated entities can corrupt downstream structured data stores and analytics.

4. How does the complexity router in a hybrid IE pipeline reduce costs?

Show Answer

The complexity router examines each document after classical NER and determines whether LLM extraction is needed. Documents that contain only standard entity types (persons, organizations, dates) are resolved entirely by the classical layer at near-zero cost. Only documents containing domain-specific signals (medical terms, legal language, complex financial events) are routed to the LLM layer. In practice, 60-80% of documents can be handled by the classical layer alone, reducing LLM API costs by a corresponding amount compared to an LLM-only pipeline.

5. What distinguishes BAML from Instructor as an approach to structured LLM output?

Show Answer

Instructor works by patching an existing LLM client (OpenAI, Anthropic) to accept Pydantic models as response schemas, handling JSON schema injection and response parsing at runtime. BAML takes a fundamentally different approach: it defines LLM functions in a dedicated schema language that compiles to type-safe client code. This means type errors are caught at compile time rather than runtime, prompt logic is separated from application code, and the schema definitions serve as documentation. BAML is better suited for large teams that need strict type safety across multiple services, while Instructor is more lightweight and integrates naturally into existing Python codebases.

Key Takeaways

Information extraction turns unstructured text into structured records through NER, relation extraction, and event extraction. Classical and LLM approaches each have distinct strengths.
Classical NER (spaCy, CRF) delivers sub-millisecond latency and 95%+ F1 on trained entity types with zero hallucination risk, making it ideal for high-volume production extraction.
LLM-based extraction enables zero-shot extraction of novel entity types, relations, and events, but requires structured output enforcement (Pydantic, Instructor, BAML) to produce reliable schemas.
Hybrid pipelines run classical NER on every document, then selectively invoke LLMs only for documents requiring domain-specific extraction. This reduces API costs by 60-80% while maintaining broad coverage.
Grounding verification is essential for LLM-extracted entities. Always check that extracted text can be traced back to the source document before storing it in production databases.
Production IE systems must implement graceful degradation (falling back to classical extraction during LLM outages), entity resolution (deduplication across documents), and confidence-aware storage (distinguishing high-confidence classical entities from lower-confidence LLM extractions).

🎓 Where This Leads Next

Module 11 has covered the core patterns for combining classical ML with LLMs. The emerging frontier is compound AI systems: multi-component architectures where retrieval, classification, generation, and verification modules work together as a coordinated pipeline. Frameworks like DSPy (covered in Section 10.3) are evolving to support production deployment of these compound systems. The broader trend is "AI engineering" as a distinct discipline, combining ML engineering, prompt engineering, and systems design. Part IV covers training and fine-tuning, which is the next lever you can pull when prompt engineering and hybrid architectures reach their limits.

💡 What Makes This Module Distinctive

Most LLM courses teach you how to use LLMs. This module taught you when NOT to use them, and how to combine them with traditional ML for production efficiency. The triage routing, cascade, and Pareto frontier analysis patterns covered here are rarely found in textbooks but are standard practice in cost-conscious production systems. The consistent pattern across all five sections: start cheap and simple, escalate to expensive and powerful only when needed.