Section 26.1: Application Architecture & Deployment

★ Big Picture

A well-designed LLM application architecture separates model inference, business logic, and API layers so that each can scale, version, and fail independently. Whether you deploy on a dedicated GPU cluster, a managed cloud endpoint, or a serverless function, the same architectural principles apply: stateless request handling, streaming for low perceived latency, health checks for reliability, and containerization for reproducibility. This section covers the full deployment stack from local development with FastAPI through production deployment on AWS, GCP, Azure, and serverless platforms.

1. API Layer with FastAPI

FastAPI is the dominant framework for serving LLM applications in Python. Its native support for asynchronous request handling, automatic OpenAPI documentation, and Pydantic validation make it ideal for building production-grade inference endpoints. The key design pattern is to separate the API layer from the model inference layer, allowing you to swap models without changing the API contract.

Basic Chat Completion Endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
import json, asyncio

app = FastAPI(title="LLM Chat API", version="1.0")
client = AsyncOpenAI()

class ChatRequest(BaseModel):
    messages: list[dict[str, str]]
    model: str = "gpt-4o-mini"
    temperature: float = 0.7
    stream: bool = False

@app.post("/v1/chat")
async def chat(req: ChatRequest):
    if req.stream:
        return StreamingResponse(
            stream_response(req), media_type="text/event-stream"
        )
    response = await client.chat.completions.create(
        model=req.model,
        messages=req.messages,
        temperature=req.temperature,
    )
    return {"content": response.choices[0].message.content}

async def stream_response(req: ChatRequest):
    stream = await client.chat.completions.create(
        model=req.model, messages=req.messages,
        temperature=req.temperature, stream=True
    )
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield f"data: {json.dumps({'content': delta})}\n\n"
    yield "data: [DONE]\n\n"

📝 Note

Always use AsyncOpenAI (not the synchronous client) inside FastAPI to avoid blocking the event loop. A synchronous call in an async endpoint will serialize all concurrent requests, destroying throughput under load.

2. Streaming Protocols: SSE and WebSockets

Streaming is essential for LLM applications because token generation is sequential. Without streaming, users stare at a blank screen for several seconds before seeing any output. Two protocols dominate LLM streaming: Server-Sent Events (SSE) for unidirectional streaming and WebSockets for bidirectional communication.

Figure 26.1.1: SSE provides simpler one-way streaming ideal for chat; WebSockets enable full-duplex communication for interactive applications.

LitServe for High-Performance Serving

import litserve as ls
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class LLMServingAPI(ls.LitAPI):
    def setup(self, device):
        """Load model once during server startup."""
        self.tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
        self.model = AutoModelForCausalLM.from_pretrained(
            "microsoft/phi-2", torch_dtype=torch.float16
        ).to(device)

    def decode_request(self, request):
        return request["prompt"]

    def predict(self, prompt):
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=256)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

    def encode_response(self, output):
        return {"generated_text": output}

# Launch with: python server.py
if __name__ == "__main__":
    api = LLMServingAPI()
    server = ls.LitServer(api, accelerator="gpu", devices=1)
    server.run(port=8000)

3. Containerization with Docker Compose

# docker-compose.yml
version: "3.9"
services:
  llm-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - MODEL_NAME=gpt-4o-mini
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - llm-api

4. Cloud Platform Deployment

Figure 26.1.2: Each major cloud provider offers managed model endpoints, container orchestration, serverless functions, and raw GPU compute.

AWS Bedrock Integration

import boto3, json

def bedrock_chat(prompt: str, model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"):
    """Call a model through AWS Bedrock."""
    client = boto3.client("bedrock-runtime", region_name="us-east-1")

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 1024,
        "temperature": 0.7,
    })

    response = client.invoke_model(modelId=model_id, body=body)
    result = json.loads(response["body"].read())
    return result["content"][0]["text"]

# Example
answer = bedrock_chat("Explain transformer attention in one paragraph.")
print(answer)

5. Serverless Deployment

Platform	Cold Start	GPU Support	Best For	Pricing Model
Modal	~1s (warm containers)	A10G, A100, H100	Custom model inference	Per-second GPU billing
Replicate	~5-15s	A40, A100	Open-source model hosting	Per-second compute
HF Inference Endpoints	~30s (scaling from 0)	T4, A10G, A100	HuggingFace model zoo	Per-hour instance
AWS Lambda	~1-3s	None (CPU only)	Lightweight API proxies	Per-request + duration
Cloud Run	~2-5s	L4 GPUs (preview)	Container-based serving	Per-request + vCPU/s

Modal Deployment Example

import modal

app = modal.App("llm-inference")

image = modal.Image.debian_slim().pip_install(
    "transformers", "torch", "accelerate"
)

@app.cls(gpu="A10G", image=image, container_idle_timeout=300)
class LLMInference:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline(
            "text-generation",
            model="meta-llama/Llama-3.1-8B-Instruct",
            device_map="auto",
            torch_dtype="auto",
        )

    @modal.method()
    def generate(self, prompt: str, max_tokens: int = 512):
        result = self.pipe(prompt, max_new_tokens=max_tokens)
        return result[0]["generated_text"]

@app.local_entrypoint()
def main():
    llm = LLMInference()
    print(llm.generate.remote("What is attention in transformers?"))

Figure 26.1.3: Deployment decision tree based on model ownership and infrastructure requirements.

⚠ Warning

Serverless GPU platforms charge per second of GPU time. A misconfigured container_idle_timeout can keep expensive GPUs running idle. Always set explicit timeouts and implement scale-to-zero for development workloads.

★ Key Insight

The best deployment strategy depends on your traffic pattern. Steady, high-volume traffic favors dedicated instances (SageMaker, Vertex AI). Bursty or low-volume traffic favors serverless (Modal, Replicate). API-only applications with no custom models should use managed endpoints (OpenAI, Bedrock, Azure OpenAI) to avoid infrastructure management entirely.

Knowledge Check

1. Why should you use AsyncOpenAI instead of the synchronous client inside FastAPI endpoints?

Show Answer

Synchronous calls block the event loop, serializing all concurrent requests. AsyncOpenAI allows FastAPI to handle multiple requests concurrently while waiting for API responses, maintaining throughput under load.

2. What is the main advantage of SSE over WebSockets for LLM chat streaming?

Show Answer

SSE is simpler to implement, works over standard HTTP, has built-in automatic reconnection, and is sufficient for the unidirectional server-to-client streaming pattern that LLM chat requires. WebSockets add unnecessary complexity when bidirectional communication is not needed.

3. When would you choose Modal over AWS SageMaker for model deployment?

Show Answer

Modal is better for bursty or development workloads because it offers per-second billing, fast cold starts, and simpler configuration. SageMaker is better for steady production workloads that need auto-scaling policies, A/B testing of model variants, and deep AWS ecosystem integration.

4. What does a health check endpoint verify in a containerized LLM deployment?

Show Answer

A health check verifies that the service is responsive and able to serve requests. For LLM applications, this typically means confirming that the model is loaded in memory, the API server is accepting connections, and any downstream dependencies (databases, API keys) are reachable.

5. Why is Docker Compose useful for local LLM application development?

Show Answer

Docker Compose orchestrates multi-service architectures (API server, Redis cache, nginx reverse proxy, vector database) with a single command. It provides reproducible environments, GPU passthrough, health checks, and dependency ordering, closely mirroring production infrastructure during development.

Key Takeaways

Separate API, business logic, and inference layers so each can scale and version independently.
Use SSE for standard chat streaming; reserve WebSockets for bidirectional use cases like real-time collaboration or voice.
LitServe provides a minimal, high-performance framework for serving custom models with batching and GPU management.
AWS Bedrock, GCP Vertex AI, and Azure OpenAI offer managed model endpoints that eliminate infrastructure management for API-based models.
Serverless platforms (Modal, Replicate, HF Inference Endpoints) are ideal for bursty workloads, while dedicated endpoints suit steady production traffic.
Containerize with Docker Compose for local development, then deploy to cloud container orchestrators (ECS, GKE, ACA) for production.