Section 7.1: Closed-Source Frontier Models

Behind every closed-source frontier model is a technical report that tells you everything except the part you actually wanted to know.

A Redacted Research Scientist

★ Big Picture

Why study closed-source models? Although their weights and training details remain proprietary, frontier closed-source models set the benchmark for what is possible with large language models. Understanding their capabilities, architectural hints, and positioning helps practitioners choose the right tool for each task, anticipate where the field is headed, and recognize the gap (or lack thereof) between proprietary and open alternatives. This section maps the landscape as of early 2025, with notes on rapidly evolving developments.

⚙ Prerequisites

This section assumes familiarity with the transformer architecture from Module 04 and the pre-training concepts from Section 6.1 (landmark models). Understanding of RLHF and alignment from Section 6.1 (InstructGPT discussion) provides context for the post-training techniques mentioned here.

1. The Frontier Model Landscape

The term "frontier model" refers to the most capable AI systems available at any given time. As of 2025, three companies consistently define the frontier: OpenAI, Anthropic, and Google DeepMind. Several other organizations, including xAI, Cohere, and Mistral, compete in specific capability niches. The competitive dynamics are intense, with new model releases arriving every few months and benchmark leads changing hands regularly.

What distinguishes these frontier models from their predecessors is not merely scale. They incorporate architectural refinements (mixture of experts, extended context mechanisms), sophisticated post-training alignment procedures (RLHF, constitutional AI, RLAIF), and increasingly, native multimodal capabilities that allow a single model to process text, images, audio, and video within a unified architecture.

Figure 7.1: The closed-source frontier model ecosystem, organized by competitive tier.

2. OpenAI: GPT-4o and the o-Series

GPT-4o: Multimodal Unification

GPT-4o (the "o" stands for "omni") represents OpenAI's push toward native multimodality. Unlike earlier systems that bolted separate vision encoders onto a text model, GPT-4o processes text, images, and audio within a single end-to-end architecture. This unification means the model can respond to a spoken question about an image without passing through separate speech-to-text and image-captioning pipelines, reducing latency and enabling richer cross-modal reasoning.

Key technical characteristics of GPT-4o include:

Context window: 128K tokens input, 16K tokens output
Multimodal input: Text, images, audio natively; video through frame sampling
Latency: Average response times of 320ms for audio, substantially faster than the GPT-4 Turbo predecessor
Pricing: Significantly lower per-token costs than GPT-4 Turbo, making it the default choice for most applications

The o-Series: Reasoning Models

OpenAI's o1 and o3 models represent a fundamentally different approach to capability scaling. Rather than simply making the model larger or training it on more data, the o-series models spend additional compute at inference time by generating extended internal chains of thought before producing a final answer. This "thinking" process is hidden from the user but can consume thousands of tokens internally.

The o1 model demonstrated dramatic improvements on tasks requiring multi-step reasoning: competitive mathematics, formal logic, complex code generation, and scientific problem-solving. The o3 model extended these capabilities further, achieving scores on benchmarks like ARC-AGI that had previously been considered out of reach for language models. We will explore the technical mechanisms behind these reasoning models in detail in Section 7.3.

📝 Pricing Tiers

OpenAI employs a tiered pricing structure. GPT-4o mini serves as the cost-effective option for high-volume, lower-complexity tasks. GPT-4o handles general-purpose work. The o-series models command premium pricing because their extended reasoning consumes substantially more compute per query. For production applications, choosing the right tier involves balancing task complexity against cost constraints.

3. Anthropic: The Claude Family

Claude 3.5 Sonnet and Constitutional AI

Anthropic's Claude models are distinguished by two core design principles: safety through Constitutional AI (CAI) and strong performance on long-context tasks. Constitutional AI works by training the model against a set of explicitly stated principles (a "constitution") rather than relying solely on human preference data. During training, the model critiques its own outputs against these principles and revises them, creating a self-improving alignment loop.

Claude 3.5 Sonnet, released in mid-2024, achieved frontier-level performance across coding, analysis, and reasoning benchmarks while maintaining a 200K token context window. Its success demonstrated that safety-focused training need not come at the cost of raw capability.

The Claude 4 Family

The Claude 4 generation introduced a model family spanning multiple capability and cost tiers:

Claude 4 Opus: The most capable model in the family, optimized for complex reasoning, nuanced analysis, and extended agentic workflows
Claude 4 Sonnet: The balanced workhorse, offering strong performance at moderate cost; designed for the majority of production use cases
Claude 4 Haiku: The speed-optimized variant for high-throughput, latency-sensitive applications

A notable architectural feature across the Claude family is the extended context window. With support for up to 200K tokens of input context, Claude can process entire codebases, lengthy legal documents, or multi-chapter manuscripts in a single pass. This capability is not merely about accepting long inputs; Anthropic has invested in ensuring that retrieval accuracy remains high even when relevant information is buried deep within the context.

⚙ Key Insight

The "needle in a haystack" problem: Many models accept long context windows but fail to reliably retrieve and use information from arbitrary positions within that context. Anthropic's Claude models have consistently scored well on "needle in a haystack" evaluations, where a specific fact is inserted at a random position within a long document and the model must locate and use it accurately. This capability matters enormously for real-world applications like document analysis and codebase understanding.

4. Google DeepMind: The Gemini Series

Gemini 2.0 and 2.5: Native Multimodality at Scale

Google's Gemini models were designed from the ground up as natively multimodal systems. While GPT-4o also handles multiple modalities, Gemini's architecture was built for this purpose from the initial pre-training stage, jointly training on text, images, audio, and video data simultaneously. This approach, Google argues, produces deeper cross-modal understanding than retrofitting multimodal capabilities onto a text-first model.

The Gemini family includes several tiers:

Model	Context Window	Strengths	Use Case
Gemini 2.5 Pro	1M tokens	Deep reasoning, "thinking" mode, code	Complex analysis, agentic tasks
Gemini 2.0 Flash	1M tokens	Speed, cost efficiency, multimodal	High-throughput production
Gemini 2.0 Pro	1M tokens	Balanced capability, world knowledge	General-purpose, coding
Gemini Ultra	1M tokens	Highest raw capability	Research, frontier tasks

The million-token context window is Gemini's signature feature. Processing up to 1 million tokens (approximately 700,000 words) in a single prompt enables use cases that were previously impossible: analyzing entire codebases, processing hours of video with audio, or reasoning over complete book-length documents. Gemini 2.5 also introduced a "thinking" mode that, similar to OpenAI's o-series, allows the model to spend additional inference compute on complex reasoning tasks.

Integration Advantages

Google's unique position as both an AI lab and a massive cloud/consumer platform gives Gemini integration advantages that pure-play AI companies cannot match. Gemini is embedded in Google Search, Google Workspace (Docs, Sheets, Gmail), Android, and the Google Cloud Vertex AI platform. For organizations already committed to the Google ecosystem, these integrations reduce friction significantly.

5. Second-Tier Frontier Models

xAI Grok

Elon Musk's xAI developed Grok with a distinctive positioning: real-time access to data from the X (formerly Twitter) platform and a more permissive content policy than competitors. Grok 2 and Grok 3 have shown competitive benchmark performance, particularly in reasoning and mathematical tasks. The Grok 3 release demonstrated impressive results on coding and scientific reasoning benchmarks, placing it alongside the Tier 1 models on several evaluations.

Cohere Command R+

Cohere's Command R+ is optimized for enterprise retrieval-augmented generation (RAG) workflows. It includes built-in citation generation, grounded responses with source attribution, and strong multilingual support across 10+ languages. Command R+ is not designed to compete head-to-head on general benchmarks; instead, it targets the specific needs of enterprise document processing and knowledge management.

Mistral Large

Mistral AI occupies a unique position as a European frontier lab with both open-source and commercial offerings. Mistral Large 2 competes with GPT-4o on many benchmarks while offering deployment options that comply with European data sovereignty requirements. Mistral's hybrid strategy (open-weight smaller models plus proprietary frontier models) gives it credibility in both the open-source community and the enterprise market.

6. Comparing the Frontier

Capability Dimensions

Comparing frontier models requires examining multiple capability dimensions, as no single model dominates across all tasks:

Dimension	Leader(s)	Notes
Mathematical reasoning	o3, Gemini 2.5 Pro	Extended thinking modes excel here
Code generation	Claude 4 Sonnet, o3	Agentic coding workflows emerging
Long context fidelity	Gemini, Claude	1M vs 200K, both strong retrieval
Multimodal understanding	Gemini 2.5, GPT-4o	Native multimodal architectures
Safety and alignment	Claude	Constitutional AI approach
Cost efficiency	Gemini Flash, GPT-4o mini	10x cheaper than flagship models
Enterprise RAG	Cohere Command R+	Built-in citation, grounding
Latency	Gemini Flash, Claude Haiku	Sub-second for simple queries

Pricing Comparison

ⓘ Note

Note: Pricing as of early 2025. LLM API pricing changes frequently; check provider websites for current rates.

Pricing for frontier models varies dramatically based on the model tier, input vs. output tokens, and whether batch or real-time processing is used. As a rough guide for input/output pricing per million tokens (as of early 2025):

# Approximate pricing comparison (per million tokens, USD)
# These prices change frequently; check provider documentation

pricing = {
    "GPT-4o":           {"input": 2.50,  "output": 10.00},
    "GPT-4o mini":      {"input": 0.15,  "output": 0.60},
    "o1":               {"input": 15.00, "output": 60.00},
    "Claude 3.5 Sonnet":{"input": 3.00,  "output": 15.00},
    "Claude 4 Opus":    {"input": 15.00, "output": 75.00},
    "Gemini 2.0 Flash": {"input": 0.10,  "output": 0.40},
    "Gemini 2.5 Pro":   {"input": 1.25,  "output": 10.00},
}

# Cost to process a 50K token document with 2K token response
def estimate_cost(model, input_tokens=50000, output_tokens=2000):
    p = pricing[model]
    cost = (input_tokens / 1_000_000) * p["input"] + \
           (output_tokens / 1_000_000) * p["output"]
    return f"{model}: ${cost:.4f}"

for model in pricing:
    print(estimate_cost(model))

GPT-4o: $0.1450 GPT-4o mini: $0.0087 o1: $0.8700 Claude 3.5 Sonnet: $0.1800 Claude 4 Opus: $0.9000 Gemini 2.0 Flash: $0.0058 Gemini 2.5 Pro: $0.0825

The cost differences are striking: for the same workload, Gemini 2.0 Flash costs $0.006 while Claude 4 Opus costs $0.90, a 150x difference. Choosing the right model tier is one of the highest-leverage decisions in production LLM deployment.

# Example: Making an API call to compare providers
# All major providers follow the OpenAI-compatible chat format
from openai import OpenAI

# OpenAI
client = OpenAI()  # uses OPENAI_API_KEY env var
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    max_tokens=50
)
print(f"GPT-4o: {response.choices[0].message.content}")
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")

# Anthropic (using OpenAI-compatible endpoint)
anthropic_client = OpenAI(
    base_url="https://api.anthropic.com/v1/",
    api_key="ANTHROPIC_API_KEY"  # or use anthropic SDK directly
)
# Similar pattern for Google (Vertex AI) and other providers

GPT-4o: 25 * 37 = 925 Tokens: 14 in, 8 out

ⓘ Where This Leads Next

The frontier model landscape evolves rapidly. Rather than memorizing current benchmarks, use this evaluation framework when assessing new model releases: (1) Check standardized benchmarks (MMLU, HumanEval, MATH) for broad capability. (2) Test on your specific use case with a held-out evaluation set. (3) Compare cost per quality point, not raw scores. (4) Verify rate limits and latency requirements. (5) Consider the provider's data privacy and retention policies. The best model for your application may not be the one topping the leaderboard.

7. Rate Limits and Practical Constraints

Beyond raw capability and pricing, production deployments must account for rate limits, throughput caps, and availability. Each provider imposes limits on tokens per minute (TPM), requests per minute (RPM), and sometimes requests per day (RPD). These limits vary by pricing tier and can be negotiated for enterprise accounts.

Key practical considerations include:

Rate limit tiers: Free tier users face strict limits (often 10K-60K TPM). Paid API users get 10x-100x higher limits. Enterprise agreements can remove most caps.
Batch processing: OpenAI and others offer batch APIs at 50% discounts with 24-hour turnaround, ideal for non-real-time workloads like evaluation, data labeling, or content generation.
Regional availability: Not all models are available in all regions. European organizations may prefer Mistral or locally hosted alternatives for data residency compliance.
Model deprecation: Providers regularly deprecate older model versions. Production systems must plan for model migration, which can affect output quality and consistency.

⚠ Caution: Benchmark Limitations

Public benchmark scores (MMLU, HumanEval, MATH, etc.) provide useful but imperfect comparisons. Model providers optimize for known benchmarks, leading to potential overfitting. Real-world performance on your specific task may differ substantially from benchmark rankings. Always evaluate candidate models on your own data and use cases before committing to a provider.

8. Architectural Insights from the Outside

Although frontier model architectures are proprietary, various signals allow us to infer architectural details:

What We Can Infer

Mixture of Experts (MoE): GPT-4 is widely believed to use an MoE architecture based on analysis of its behavior and leaked reports. This would explain its strong multi-task performance while keeping inference costs manageable.
Tokenizer design: API-accessible tokenizers reveal vocabulary size and subword strategies. GPT-4o uses cl100k_base with 100K tokens; Claude uses a custom tokenizer; Gemini's tokenizer handles multimodal inputs natively.
Context window implementation: The jump from 8K to 128K to 1M contexts suggests different underlying attention mechanisms, likely including some form of sparse attention, sliding window, or hierarchical attention for the longest contexts.
Post-training pipeline: Anthropic has published details about Constitutional AI. OpenAI has discussed RLHF. Google has described RLAIF (RL from AI Feedback). Each approach produces measurably different model behaviors in terms of safety, helpfulness, and stylistic tendencies.

# Inspecting tokenizer behavior across providers
import tiktoken

# OpenAI GPT-4o tokenizer
enc = tiktoken.encoding_for_model("gpt-4o")
text = "The quick brown fox jumps over the lazy dog."
tokens = enc.encode(text)
print(f"GPT-4o tokens: {len(tokens)}")
print(f"Token IDs: {tokens}")
print(f"Decoded tokens: {[enc.decode([t]) for t in tokens]}")

# Compare: How different models tokenize the same multilingual text
multilingual = "Hello world. Bonjour le monde. Hola mundo."
print(f"\nMultilingual text tokens: {len(enc.encode(multilingual))}")
# Different tokenizers will produce different token counts,
# reflecting their training data distribution

9. The Convergence Trend

A striking trend in the frontier model landscape is convergence. In 2023, GPT-4 held a commanding lead on most benchmarks. By 2025, Claude, Gemini, and even some open-weight models have closed much of the gap. On many practical tasks, the differences between frontier models are smaller than the differences caused by prompt engineering choices or task-specific fine-tuning.

This convergence has several implications for practitioners:

Multi-provider strategies reduce risk: If your application's prompts work well across Claude, GPT-4o, and Gemini, you gain resilience against outages, pricing changes, and deprecation.
Differentiation moves to the edges: The competitive advantage increasingly comes from context length, multimodal capabilities, tool use, latency, or specialized enterprise features rather than raw text generation quality.
Open models are catching up: As we will see in Section 7.2, open-weight models now approach frontier capability on many tasks, raising questions about the long-term value proposition of closed-source alternatives.

Section 7.1 Quiz

1. What does the "o" in GPT-4o stand for, and what architectural distinction does it represent?

Reveal Answer

The "o" stands for "omni." GPT-4o is designed as a natively multimodal model that processes text, images, and audio within a single end-to-end architecture, rather than using separate pipelines for each modality. This unified approach reduces latency and enables richer cross-modal reasoning.

2. How does Anthropic's Constitutional AI differ from standard RLHF?

Reveal Answer

Standard RLHF relies primarily on human preference data to train a reward model. Constitutional AI trains the model against an explicit set of stated principles (a "constitution"). The model critiques its own outputs against these principles and revises them, creating a self-improving alignment loop that requires less human annotation for each iteration.

3. What is the approximate context window for Gemini 2.5 Pro, and why is this significant?

Reveal Answer

Gemini 2.5 Pro supports approximately 1 million tokens of context (roughly 700,000 words). This is significant because it enables use cases such as processing entire codebases, analyzing hours of video with audio, or reasoning over complete book-length documents in a single prompt. The challenge is not just accepting long inputs but maintaining retrieval accuracy throughout the full context.

4. Why should practitioners be cautious when comparing models solely on public benchmark scores?

Reveal Answer

Model providers are aware of major benchmarks and may optimize for them, leading to potential overfitting. Real-world performance on specific tasks can differ substantially from benchmark rankings. Additionally, benchmarks may not capture dimensions that matter for production use, such as latency, consistency, instruction-following on domain-specific tasks, or behavior on edge cases. Always evaluate on your own data.

5. (Application) You are building a document analysis system that needs to process 200-page legal contracts, extract key clauses, and answer questions. The system processes 500 documents per day. Which frontier model family would you recommend, and what factors would determine your choice?

Reveal Answer

The key factors are: (1) Context length: 200-page contracts are roughly 100K to 150K tokens, so you need a model supporting at least 200K context. Gemini 2.5 Pro (1M context) and Claude (200K context) both qualify; GPT-4o's 128K may be tight. (2) Cost: At 500 docs/day with ~150K tokens each, the monthly input volume is ~2.25B tokens. Gemini 2.0 Flash would cost roughly $225/month; Claude 3.5 Sonnet roughly $6,750; GPT-4o roughly $5,625. (3) Accuracy: For legal documents, you need high factual precision. Test all candidates on your specific contract types with a held-out evaluation set. The recommended approach: prototype with Gemini 2.5 Pro for quality validation, then evaluate whether Gemini 2.0 Flash provides sufficient accuracy for production at 30x lower cost.

6. Name two reasons why a "multi-provider strategy" reduces risk for production LLM applications.

Reveal Answer

A multi-provider strategy reduces risk by providing (1) resilience against outages, since if one provider goes down, traffic can be routed to another; and (2) protection against pricing changes and model deprecation, since providers regularly change prices and retire older model versions. Additionally, it creates competitive leverage when negotiating enterprise contracts.

Key Takeaways

Three Tier-1 players define the frontier: OpenAI (GPT-4o, o-series), Anthropic (Claude family), and Google DeepMind (Gemini). Each has distinct strengths in reasoning, safety, multimodality, or context length.
Reasoning models (o1/o3, Gemini "thinking" mode) represent a paradigm shift: spending more compute at inference time rather than only at training time. This enables dramatic improvements on mathematical and logical reasoning tasks.
Native multimodality is replacing modular pipelines. GPT-4o and Gemini process text, images, and audio in unified architectures, improving cross-modal reasoning and reducing latency.
Context windows have expanded dramatically: from 4K tokens in 2022 to 1M tokens in 2025. Long-context fidelity (not just capacity) is a key differentiator.
Convergence is real: the gap between frontier models has narrowed substantially, making provider choice more about integration, pricing, and specific capability needs than overall quality.
Always evaluate on your own tasks. Benchmarks are useful signals, not guarantees. The best model for your application depends on your specific data, latency requirements, and budget.