Section 9.1: API Landscape & Architecture

★ Big Picture

A team at a fintech startup shipped their first LLM feature using the OpenAI API. It worked beautifully in testing. Then the bill arrived: $12,000 for one month, ten times their estimate. They had missed that their retry logic resubmitted full conversation histories on every 429 error, that their prompt included 800 tokens of unused system instructions, and that Anthropic's prompt caching could have cut costs by 90% for their repetitive workload. The API you choose, and how you call it, is not a footnote. It is a core architectural decision that shapes your cost structure, latency profile, and reliability posture. This section builds the fluency you need to make that decision well: we survey the major providers, compare their architectures side by side, and establish the patterns that will carry through the rest of Part III.

1. The LLM API Ecosystem

⚠ Pricing Note

All pricing figures in this module reflect approximate rates as of early 2025. LLM API prices change frequently, often decreasing by 50% or more within a year. Always check provider documentation for current pricing before making architectural decisions based on cost. Where possible, we express costs as ratios (e.g., "10x cheaper") rather than absolute dollar amounts to extend shelf life.

The modern LLM API ecosystem is built around a surprisingly simple pattern: you send a sequence of messages (a conversation) to an HTTP endpoint, and the model returns a completion. Despite this conceptual simplicity, each provider has evolved distinct conventions, capabilities, and pricing models that shape how you build applications.

The major commercial providers include OpenAI, Anthropic, and Google. Each offers hosted inference with usage-based pricing, typically measured in tokens. Beyond these, cloud platforms like AWS Bedrock and Azure OpenAI provide enterprise wrappers that add compliance, networking, and billing features on top of the same underlying models. For open-source models, serving frameworks like vLLM, TGI, and Ollama expose OpenAI-compatible endpoints, creating a de facto standard.

Figure 9.1: The LLM API ecosystem. All major providers share a common HTTP + JSON pattern, but differ in message format, parameter names, and special capabilities.

2. OpenAI Chat Completions API

The OpenAI Chat Completions API established the dominant pattern for LLM APIs. You send a list of messages, each with a role (system, user, or assistant) and content, along with generation parameters. The API returns a completion object containing the model's response, token usage counts, and metadata.

2.1 Core Request Structure

The key parameters that control generation behavior are:

model: The model identifier (e.g., gpt-4o, o4-mini)
messages: The conversation history as an array of role/content objects
temperature: Controls randomness (0.0 = deterministic, 2.0 = maximum randomness)
top_p: Nucleus sampling threshold (alternative to temperature)
max_tokens: Maximum number of tokens to generate
frequency_penalty: Penalizes tokens based on how often they have appeared (range: -2.0 to 2.0)
presence_penalty: Penalizes tokens that have appeared at all (range: -2.0 to 2.0)

from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from environment

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain Python list comprehensions in two sentences."}
    ],
    temperature=0.7,
    max_tokens=150
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.prompt_tokens} input, "
      f"{response.usage.completion_tokens} output")

A list comprehension is a concise way to create lists in Python by writing an expression followed by a for clause inside square brackets. For example, [x**2 for x in range(10)] creates a list of squares from 0 to 81, and you can add an optional if clause to filter elements. Tokens used: 28 input, 52 output

ⓘ Note

Temperature vs. top_p: OpenAI recommends adjusting one or the other, not both simultaneously. Temperature scales the logits before softmax, while top_p truncates the distribution after softmax. Using both can create unpredictable interactions. For deterministic output, set temperature=0. For creative tasks, try temperature=0.8 to 1.2.

2.2 Streaming Responses with Server-Sent Events

For interactive applications, waiting for the entire response to complete before displaying anything creates a poor user experience. Streaming delivers tokens as they are generated using the Server-Sent Events (SSE) protocol. Each chunk arrives as a small JSON object containing the delta (the new content since the last chunk).

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Write a haiku about APIs."}
    ],
    stream=True
)

collected_content = ""
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        collected_content += delta.content
        print(delta.content, end="", flush=True)

print(f"\n\nFull response: {collected_content}")

Signals cross the wire JSON carries the question Tokens flow like rain Full response: Signals cross the wire JSON carries the question Tokens flow like rain

✓ Key Insight

When to stream: Streaming is essential for chatbots and interactive UIs where time-to-first-token (TTFT) matters more than total latency. For batch processing or backend pipelines where you need the complete response before proceeding, non-streaming calls are simpler and provide the full usage object in a single response.

2.3 The Batch API

OpenAI's Batch API allows you to submit large collections of requests (up to 50,000) that are processed asynchronously within a 24-hour window. The key advantage is a 50% cost reduction compared to synchronous API calls. This makes it ideal for tasks like dataset labeling, bulk classification, or evaluation pipelines where real-time responses are unnecessary.

from openai import OpenAI
import json

client = OpenAI()

# Step 1: Create a JSONL file with batch requests
requests = [
    {
        "custom_id": f"review-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [
                {"role": "system", "content": "Classify the sentiment as positive, negative, or neutral."},
                {"role": "user", "content": review}
            ],
            "max_tokens": 10
        }
    }
    for i, review in enumerate(["Great product!", "Terrible service.", "It was okay."])
]

with open("batch_input.jsonl", "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Step 2: Upload and submit the batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions",
                                   completion_window="24h")
print(f"Batch submitted: {batch_job.id}, status: {batch_job.status}")

Batch submitted: batch_abc123def456, status: validating

3. Anthropic Messages API

Anthropic's Messages API follows a similar conversational pattern but introduces several distinctive design choices. System prompts are a separate top-level parameter rather than a message role. The API supports prompt caching (which can reduce costs by up to 90% for repeated prefixes), extended thinking for complex reasoning, and a robust tool use system.

3.1 Core Differences from OpenAI

The most notable architectural differences include:

System prompt: Passed as a top-level system parameter, not inside the messages array
Max tokens: Required (not optional), specified as max_tokens
Content blocks: Responses use content blocks (an array of typed objects) rather than a single string
Prompt caching: Explicit cache control markers let you cache expensive prefixes
Extended thinking: Built-in chain-of-thought reasoning that exposes the model's thinking process

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    system="You are a helpful coding assistant. Be concise.",
    messages=[
        {"role": "user", "content": "Explain Python list comprehensions in two sentences."}
    ]
)

print(response.content[0].text)
print(f"Tokens: {response.usage.input_tokens} input, "
      f"{response.usage.output_tokens} output")
print(f"Stop reason: {response.stop_reason}")

A list comprehension lets you create a new list by applying an expression to each item in an iterable, all in a single readable line like [x**2 for x in range(10)]. You can also add conditions to filter elements, such as [x for x in data if x > 0]. Tokens: 26 input, 58 output Stop reason: end_turn

3.2 Prompt Caching

Anthropic's prompt caching feature is particularly valuable when you repeatedly send requests with a shared prefix (for example, a long system prompt or a large document in context). By marking content blocks with cache_control, you tell the API to cache that prefix. Subsequent requests that share the same cached prefix receive a significant discount on input token costs and reduced latency.

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=300,
    system=[
        {
            "type": "text",
            "text": "You are an expert on the Python standard library. " * 50,  # Long system prompt
            "cache_control": {"type": "ephemeral"}  # Cache this prefix
        }
    ],
    messages=[
        {"role": "user", "content": "What is the difference between os.path and pathlib?"}
    ]
)

# Check cache performance
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

Cache creation tokens: 262 Cache read tokens: 0

4. Google Gemini API

Google's Gemini API uses a generateContent endpoint that shares the same conversational pattern but employs different terminology. Messages are called "contents," roles are "user" and "model" (not "assistant"), and the API supports unique features like grounding (connecting to Google Search), code execution, and native multimodal input.

from google import genai

client = genai.Client()  # reads GOOGLE_API_KEY from environment

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain Python list comprehensions in two sentences.",
    config=genai.types.GenerateContentConfig(
        temperature=0.7,
        max_output_tokens=150,
        system_instruction="You are a helpful coding assistant."
    )
)

print(response.text)
print(f"Tokens: {response.usage_metadata.prompt_token_count} input, "
      f"{response.usage_metadata.candidates_token_count} output")

A list comprehension provides a compact syntax for creating lists by applying an expression to each element of an iterable, written as [expression for item in iterable]. You can optionally include conditions to filter elements, like [x for x in numbers if x > 0]. Tokens: 18 input, 47 output

5. Enterprise Wrappers: AWS Bedrock and Azure OpenAI

Enterprise cloud providers wrap the underlying models with additional infrastructure for security, compliance, and billing. AWS Bedrock provides access to models from Anthropic, Meta, Mistral, and others through a unified API with IAM authentication, VPC endpoints, and consolidated AWS billing. Azure OpenAI deploys the same OpenAI models within Microsoft's cloud, adding features like content filtering, private networking, and regional data residency.

The following example shows how to call Claude on AWS Bedrock using boto3. The key difference from the direct Anthropic API is authentication (IAM credentials instead of an API key) and the endpoint configuration:

import boto3
import json

# Bedrock uses IAM credentials (configured via AWS CLI or environment variables)
bedrock = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1"
)

body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 256,
    "messages": [
        {"role": "user", "content": "Explain what an API gateway does in two sentences."}
    ]
})

response = bedrock.invoke_model(
    modelId="anthropic.claude-sonnet-4-20250514-v1:0",
    body=body,
    contentType="application/json",
    accept="application/json"
)

result = json.loads(response["body"].read())
print(result["content"][0]["text"])

An API gateway acts as a single entry point for client requests, routing them to the appropriate backend services while handling cross-cutting concerns like authentication and rate limiting. It simplifies client interactions by abstracting the complexity of the underlying microservices architecture.

⚠ Warning

API version drift: Enterprise wrappers sometimes lag behind the direct provider APIs by days or weeks. A feature available on api.openai.com may not yet be available on Azure OpenAI (or may require a specific API version string). Always check the enterprise wrapper's documentation for supported features and versions before building against them.

ⓘ Multimodal APIs

All three major providers (GPT-4o, Claude, Gemini) now support image input alongside text. You can send images as base64-encoded data or URLs in the messages array. While this section focuses on text APIs, be aware that these same endpoints handle multimodal input. Consult each provider's documentation for image formatting details, token counting for images, and supported image formats.

6. Open-Source Serving: The OpenAI-Compatible Pattern

A powerful convention has emerged in the open-source ecosystem: serving frameworks expose an API that mimics the OpenAI Chat Completions format. This means you can point the OpenAI Python SDK at any compatible server by changing the base_url, and your existing code works without modification. Frameworks like vLLM, Text Generation Inference (TGI), and Ollama all support this pattern.

from openai import OpenAI

# Point the OpenAI client at a local vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Local servers often skip auth
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "user", "content": "Explain Python list comprehensions in two sentences."}
    ],
    temperature=0.7,
    max_tokens=150
)

print(response.choices[0].message.content)

List comprehensions in Python let you build a new list by applying an expression to each element of an existing iterable in a single line. They can include optional filtering with an if clause, making them a concise alternative to traditional for loops with append calls.

✓ Key Insight

The portability benefit: Because the OpenAI-compatible API has become a de facto standard, code written against the OpenAI SDK can run against local models, cloud providers, or even custom fine-tuned models with zero changes to the application logic. Only the base_url and model name change. This is a deliberate design choice by the open-source community to reduce switching costs.

7. Provider Comparison

The following table summarizes the key differences across the major providers. Understanding these differences helps you choose the right provider for your use case and design portable abstractions.

Feature	OpenAI	Anthropic	Google Gemini
Endpoint	`/chat/completions`	`/messages`	`generateContent`
System prompt	Message with role "system"	Top-level `system` param	`system_instruction` config
Assistant role name	`assistant`	`assistant`	`model`
Max tokens	Optional	Required	Optional
Streaming	SSE (`stream=True`)	SSE (`stream=True`)	SSE (`stream=True`)
Prompt caching	Automatic	Explicit markers	Context caching API
Batch API	Yes (50% discount)	Yes (Message Batches)	No dedicated batch
Tool use	Function calling	Tool use	Function declarations

Figure 9.2: The universal request/response cycle for LLM APIs. Despite naming differences, every provider follows this same five-step pipeline.

8. Authentication and Rate Limits

Every provider requires authentication, typically via an API key passed in the Authorization header. Enterprise providers like Azure OpenAI also support Azure Active Directory tokens. Rate limits are enforced at multiple levels: requests per minute (RPM), tokens per minute (TPM), and sometimes requests per day (RPD). Exceeding these limits results in HTTP 429 (Too Many Requests) responses.

⚠ Warning

Never hardcode API keys. Store them in environment variables, secrets managers (like AWS Secrets Manager or HashiCorp Vault), or .env files that are excluded from version control. A leaked API key can result in unauthorized usage and significant charges on your account.

Rate limit headers returned with each response tell you how much capacity remains. Monitoring these headers allows you to implement proactive throttling rather than waiting for 429 errors:

x-ratelimit-remaining-requests: Requests remaining in the current window
x-ratelimit-remaining-tokens: Tokens remaining in the current window
x-ratelimit-reset-requests: Time until the request limit resets
x-ratelimit-reset-tokens: Time until the token limit resets

9. Choosing Between Providers

The right provider depends on your specific requirements. Consider these factors when making your decision:

Model capability: For the most complex reasoning tasks, compare the latest models from each provider on your specific use case. Benchmarks help, but real evaluation on your data is essential.
Pricing: Input and output tokens are priced differently. Anthropic and Google tend to offer competitive pricing for large-context workloads. OpenAI's Batch API cuts costs by 50% for non-real-time tasks.
Context window: Gemini supports up to 1M tokens; Anthropic's Claude supports 200K; OpenAI's GPT-4o supports 128K. Longer contexts enable retrieval-free processing of large documents.
Compliance: Enterprise wrappers (Bedrock, Azure) offer HIPAA, SOC 2, and data residency guarantees that direct API access may not.
Latency: Test time-to-first-token and total generation time from your deployment region. Geographic proximity to provider data centers matters.

ⓘ Note

Multi-provider strategy: Many production systems use multiple providers. A common pattern is to route simple tasks to smaller, cheaper models and reserve expensive frontier models for complex queries. We will explore this routing pattern in detail in Section 9.3.

Knowledge Check

1. In the Anthropic Messages API, where is the system prompt specified?

Show Answer

The system prompt is specified as a separate top-level system parameter in the API call, not as a message with the "system" role inside the messages array. This is a key architectural difference from the OpenAI API.

2. What is the primary advantage of the OpenAI Batch API?

Show Answer

The Batch API provides a 50% cost reduction compared to synchronous API calls. Requests are processed asynchronously within a 24-hour window, making it ideal for non-real-time tasks like bulk classification, dataset labeling, or evaluation pipelines.

3. Why has the OpenAI-compatible API format become a de facto standard for open-source model serving?

Show Answer

Because it allows developers to use the OpenAI Python SDK with any compatible server simply by changing the base_url parameter. This means existing application code works without modification when switching between providers, reducing switching costs and making local and cloud deployments interchangeable.

4. What HTTP status code indicates you have exceeded an API rate limit?

Show Answer

HTTP 429 (Too Many Requests). Rate limits are enforced at multiple levels: requests per minute (RPM), tokens per minute (TPM), and sometimes requests per day (RPD). The response headers include information about remaining capacity and reset times.

5. Name two differences between how Google Gemini and OpenAI label their message roles.

Show Answer

First, Gemini uses "model" for the assistant's role while OpenAI uses "assistant." Second, Gemini uses "contents" (plural) as the parameter name for the message list while OpenAI uses "messages." These naming differences require translation when building provider-agnostic abstractions.

🛠 Modify and Observe

Pick any code example from this section and try these experiments:

Change the temperature from 0.0 to 1.0 and run the same prompt five times. Compare how the outputs vary. At what temperature do you start seeing meaningfully different responses?
Set max_tokens to 10 on a question that requires a long answer. Observe how the model truncates its response mid-sentence. This is why production code must handle incomplete responses gracefully.
Send the same prompt to two different providers (e.g., OpenAI and the vLLM local example). Compare the outputs, latency, and token counts. Note which differences are cosmetic and which are substantive.

Key Takeaways

Universal pattern: All major LLM APIs follow the same core pattern: send a conversation (list of messages) with generation parameters via HTTP POST, receive a JSON response with the completion and token usage.
Provider-specific details matter: Despite the shared pattern, each provider uses different field names, places system prompts differently, and offers unique capabilities like prompt caching (Anthropic), grounding (Google), or batch processing (OpenAI).
Streaming is essential for UX: Server-Sent Events allow token-by-token delivery, which is critical for interactive applications where time-to-first-token matters.
The OpenAI-compatible format is the de facto standard: Open-source serving frameworks (vLLM, TGI, Ollama) all expose this format, making the OpenAI SDK a universal client.
Enterprise wrappers add compliance, not models: AWS Bedrock and Azure OpenAI wrap the same underlying models with enterprise features like private networking, IAM, and data residency.
Cost optimization starts with API choice: Batch APIs, prompt caching, and model routing can each reduce costs by 50% or more for appropriate workloads.