Module 26 · Section 26.1

Application Architecture & Deployment

FastAPI, LitServe, streaming protocols, Docker Compose, cloud platforms, and serverless deployment for production LLM applications
★ Big Picture

A well-designed LLM application architecture separates model inference, business logic, and API layers so that each can scale, version, and fail independently. Whether you deploy on a dedicated GPU cluster, a managed cloud endpoint, or a serverless function, the same architectural principles apply: stateless request handling, streaming for low perceived latency, health checks for reliability, and containerization for reproducibility. This section covers the full deployment stack from local development with FastAPI through production deployment on AWS, GCP, Azure, and serverless platforms.

1. API Layer with FastAPI

FastAPI is the dominant framework for serving LLM applications in Python. Its native support for asynchronous request handling, automatic OpenAPI documentation, and Pydantic validation make it ideal for building production-grade inference endpoints. The key design pattern is to separate the API layer from the model inference layer, allowing you to swap models without changing the API contract.

Basic Chat Completion Endpoint

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
import json, asyncio

app = FastAPI(title="LLM Chat API", version="1.0")
client = AsyncOpenAI()

class ChatRequest(BaseModel):
    messages: list[dict[str, str]]
    model: str = "gpt-4o-mini"
    temperature: float = 0.7
    stream: bool = False

@app.post("/v1/chat")
async def chat(req: ChatRequest):
    if req.stream:
        return StreamingResponse(
            stream_response(req), media_type="text/event-stream"
        )
    response = await client.chat.completions.create(
        model=req.model,
        messages=req.messages,
        temperature=req.temperature,
    )
    return {"content": response.choices[0].message.content}

async def stream_response(req: ChatRequest):
    stream = await client.chat.completions.create(
        model=req.model, messages=req.messages,
        temperature=req.temperature, stream=True
    )
    async for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield f"data: {json.dumps({'content': delta})}\n\n"
    yield "data: [DONE]\n\n"
📝 Note

Always use AsyncOpenAI (not the synchronous client) inside FastAPI to avoid blocking the event loop. A synchronous call in an async endpoint will serialize all concurrent requests, destroying throughput under load.

2. Streaming Protocols: SSE and WebSockets

Streaming is essential for LLM applications because token generation is sequential. Without streaming, users stare at a blank screen for several seconds before seeing any output. Two protocols dominate LLM streaming: Server-Sent Events (SSE) for unidirectional streaming and WebSockets for bidirectional communication.

Server-Sent Events (SSE) HTTP/1.1 or HTTP/2 Unidirectional (server to client) Auto-reconnect built in Best for: chat completions text/event-stream WebSockets Persistent TCP connection Bidirectional (both ways) Manual reconnect logic Best for: real-time collab, voice ws:// or wss://
Figure 26.1.1: SSE provides simpler one-way streaming ideal for chat; WebSockets enable full-duplex communication for interactive applications.

LitServe for High-Performance Serving

import litserve as ls
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class LLMServingAPI(ls.LitAPI):
    def setup(self, device):
        """Load model once during server startup."""
        self.tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
        self.model = AutoModelForCausalLM.from_pretrained(
            "microsoft/phi-2", torch_dtype=torch.float16
        ).to(device)

    def decode_request(self, request):
        return request["prompt"]

    def predict(self, prompt):
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=256)
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

    def encode_response(self, output):
        return {"generated_text": output}

# Launch with: python server.py
if __name__ == "__main__":
    api = LLMServingAPI()
    server = ls.LitServer(api, accelerator="gpu", devices=1)
    server.run(port=8000)

3. Containerization with Docker Compose

# docker-compose.yml
version: "3.9"
services:
  llm-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - MODEL_NAME=gpt-4o-mini
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - llm-api

4. Cloud Platform Deployment

Cloud Deployment Landscape AWS Bedrock (managed APIs) SageMaker Endpoints ECS / EKS (containers) Lambda (serverless) EC2 + GPU instances Google Cloud Vertex AI Endpoints Cloud Run (containers) GKE (Kubernetes) Cloud Functions Compute Engine + TPUs Azure Azure OpenAI Service Azure ML Endpoints Container Apps Azure Functions VMs + NCasT4 / A100
Figure 26.1.2: Each major cloud provider offers managed model endpoints, container orchestration, serverless functions, and raw GPU compute.

AWS Bedrock Integration

import boto3, json

def bedrock_chat(prompt: str, model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"):
    """Call a model through AWS Bedrock."""
    client = boto3.client("bedrock-runtime", region_name="us-east-1")

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": 1024,
        "temperature": 0.7,
    })

    response = client.invoke_model(modelId=model_id, body=body)
    result = json.loads(response["body"].read())
    return result["content"][0]["text"]

# Example
answer = bedrock_chat("Explain transformer attention in one paragraph.")
print(answer)

5. Serverless Deployment

Platform Cold Start GPU Support Best For Pricing Model
Modal~1s (warm containers)A10G, A100, H100Custom model inferencePer-second GPU billing
Replicate~5-15sA40, A100Open-source model hostingPer-second compute
HF Inference Endpoints~30s (scaling from 0)T4, A10G, A100HuggingFace model zooPer-hour instance
AWS Lambda~1-3sNone (CPU only)Lightweight API proxiesPer-request + duration
Cloud Run~2-5sL4 GPUs (preview)Container-based servingPer-request + vCPU/s

Modal Deployment Example

import modal

app = modal.App("llm-inference")

image = modal.Image.debian_slim().pip_install(
    "transformers", "torch", "accelerate"
)

@app.cls(gpu="A10G", image=image, container_idle_timeout=300)
class LLMInference:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline(
            "text-generation",
            model="meta-llama/Llama-3.1-8B-Instruct",
            device_map="auto",
            torch_dtype="auto",
        )

    @modal.method()
    def generate(self, prompt: str, max_tokens: int = 512):
        result = self.pipe(prompt, max_new_tokens=max_tokens)
        return result[0]["generated_text"]

@app.local_entrypoint()
def main():
    llm = LLMInference()
    print(llm.generate.remote("What is attention in transformers?"))
Custom model weights? Yes Need auto-scaling? Yes SageMaker / Vertex AI No Modal / Replicate No Data residency needs? Yes Azure OpenAI / Bedrock No Direct API (OpenAI, etc.)
Figure 26.1.3: Deployment decision tree based on model ownership and infrastructure requirements.
⚠ Warning

Serverless GPU platforms charge per second of GPU time. A misconfigured container_idle_timeout can keep expensive GPUs running idle. Always set explicit timeouts and implement scale-to-zero for development workloads.

★ Key Insight

The best deployment strategy depends on your traffic pattern. Steady, high-volume traffic favors dedicated instances (SageMaker, Vertex AI). Bursty or low-volume traffic favors serverless (Modal, Replicate). API-only applications with no custom models should use managed endpoints (OpenAI, Bedrock, Azure OpenAI) to avoid infrastructure management entirely.

Knowledge Check

1. Why should you use AsyncOpenAI instead of the synchronous client inside FastAPI endpoints?

Show Answer
Synchronous calls block the event loop, serializing all concurrent requests. AsyncOpenAI allows FastAPI to handle multiple requests concurrently while waiting for API responses, maintaining throughput under load.

2. What is the main advantage of SSE over WebSockets for LLM chat streaming?

Show Answer
SSE is simpler to implement, works over standard HTTP, has built-in automatic reconnection, and is sufficient for the unidirectional server-to-client streaming pattern that LLM chat requires. WebSockets add unnecessary complexity when bidirectional communication is not needed.

3. When would you choose Modal over AWS SageMaker for model deployment?

Show Answer
Modal is better for bursty or development workloads because it offers per-second billing, fast cold starts, and simpler configuration. SageMaker is better for steady production workloads that need auto-scaling policies, A/B testing of model variants, and deep AWS ecosystem integration.

4. What does a health check endpoint verify in a containerized LLM deployment?

Show Answer
A health check verifies that the service is responsive and able to serve requests. For LLM applications, this typically means confirming that the model is loaded in memory, the API server is accepting connections, and any downstream dependencies (databases, API keys) are reachable.

5. Why is Docker Compose useful for local LLM application development?

Show Answer
Docker Compose orchestrates multi-service architectures (API server, Redis cache, nginx reverse proxy, vector database) with a single command. It provides reproducible environments, GPU passthrough, health checks, and dependency ordering, closely mirroring production infrastructure during development.

Key Takeaways