A well-designed LLM application architecture separates model inference, business logic, and API layers so that each can scale, version, and fail independently. Whether you deploy on a dedicated GPU cluster, a managed cloud endpoint, or a serverless function, the same architectural principles apply: stateless request handling, streaming for low perceived latency, health checks for reliability, and containerization for reproducibility. This section covers the full deployment stack from local development with FastAPI through production deployment on AWS, GCP, Azure, and serverless platforms.
1. API Layer with FastAPI
FastAPI is the dominant framework for serving LLM applications in Python. Its native support for asynchronous request handling, automatic OpenAPI documentation, and Pydantic validation make it ideal for building production-grade inference endpoints. The key design pattern is to separate the API layer from the model inference layer, allowing you to swap models without changing the API contract.
Basic Chat Completion Endpoint
from fastapi import FastAPI from fastapi.responses import StreamingResponse from pydantic import BaseModel from openai import AsyncOpenAI import json, asyncio app = FastAPI(title="LLM Chat API", version="1.0") client = AsyncOpenAI() class ChatRequest(BaseModel): messages: list[dict[str, str]] model: str = "gpt-4o-mini" temperature: float = 0.7 stream: bool = False @app.post("/v1/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( stream_response(req), media_type="text/event-stream" ) response = await client.chat.completions.create( model=req.model, messages=req.messages, temperature=req.temperature, ) return {"content": response.choices[0].message.content} async def stream_response(req: ChatRequest): stream = await client.chat.completions.create( model=req.model, messages=req.messages, temperature=req.temperature, stream=True ) async for chunk in stream: delta = chunk.choices[0].delta.content if delta: yield f"data: {json.dumps({'content': delta})}\n\n" yield "data: [DONE]\n\n"
Always use AsyncOpenAI (not the synchronous client) inside FastAPI to avoid blocking the event loop. A synchronous call in an async endpoint will serialize all concurrent requests, destroying throughput under load.
2. Streaming Protocols: SSE and WebSockets
Streaming is essential for LLM applications because token generation is sequential. Without streaming, users stare at a blank screen for several seconds before seeing any output. Two protocols dominate LLM streaming: Server-Sent Events (SSE) for unidirectional streaming and WebSockets for bidirectional communication.
LitServe for High-Performance Serving
import litserve as ls from transformers import AutoModelForCausalLM, AutoTokenizer import torch class LLMServingAPI(ls.LitAPI): def setup(self, device): """Load model once during server startup.""" self.tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") self.model = AutoModelForCausalLM.from_pretrained( "microsoft/phi-2", torch_dtype=torch.float16 ).to(device) def decode_request(self, request): return request["prompt"] def predict(self, prompt): inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device) outputs = self.model.generate(**inputs, max_new_tokens=256) return self.tokenizer.decode(outputs[0], skip_special_tokens=True) def encode_response(self, output): return {"generated_text": output} # Launch with: python server.py if __name__ == "__main__": api = LLMServingAPI() server = ls.LitServer(api, accelerator="gpu", devices=1) server.run(port=8000)
3. Containerization with Docker Compose
# docker-compose.yml version: "3.9" services: llm-api: build: . ports: - "8000:8000" environment: - OPENAI_API_KEY=${OPENAI_API_KEY} - MODEL_NAME=gpt-4o-mini deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 3 redis: image: redis:7-alpine ports: - "6379:6379" nginx: image: nginx:alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - llm-api
4. Cloud Platform Deployment
AWS Bedrock Integration
import boto3, json def bedrock_chat(prompt: str, model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"): """Call a model through AWS Bedrock.""" client = boto3.client("bedrock-runtime", region_name="us-east-1") body = json.dumps({ "anthropic_version": "bedrock-2023-05-31", "messages": [{"role": "user", "content": prompt}], "max_tokens": 1024, "temperature": 0.7, }) response = client.invoke_model(modelId=model_id, body=body) result = json.loads(response["body"].read()) return result["content"][0]["text"] # Example answer = bedrock_chat("Explain transformer attention in one paragraph.") print(answer)
5. Serverless Deployment
| Platform | Cold Start | GPU Support | Best For | Pricing Model |
|---|---|---|---|---|
| Modal | ~1s (warm containers) | A10G, A100, H100 | Custom model inference | Per-second GPU billing |
| Replicate | ~5-15s | A40, A100 | Open-source model hosting | Per-second compute |
| HF Inference Endpoints | ~30s (scaling from 0) | T4, A10G, A100 | HuggingFace model zoo | Per-hour instance |
| AWS Lambda | ~1-3s | None (CPU only) | Lightweight API proxies | Per-request + duration |
| Cloud Run | ~2-5s | L4 GPUs (preview) | Container-based serving | Per-request + vCPU/s |
Modal Deployment Example
import modal app = modal.App("llm-inference") image = modal.Image.debian_slim().pip_install( "transformers", "torch", "accelerate" ) @app.cls(gpu="A10G", image=image, container_idle_timeout=300) class LLMInference: @modal.enter() def load_model(self): from transformers import pipeline self.pipe = pipeline( "text-generation", model="meta-llama/Llama-3.1-8B-Instruct", device_map="auto", torch_dtype="auto", ) @modal.method() def generate(self, prompt: str, max_tokens: int = 512): result = self.pipe(prompt, max_new_tokens=max_tokens) return result[0]["generated_text"] @app.local_entrypoint() def main(): llm = LLMInference() print(llm.generate.remote("What is attention in transformers?"))
Serverless GPU platforms charge per second of GPU time. A misconfigured container_idle_timeout can keep expensive GPUs running idle. Always set explicit timeouts and implement scale-to-zero for development workloads.
The best deployment strategy depends on your traffic pattern. Steady, high-volume traffic favors dedicated instances (SageMaker, Vertex AI). Bursty or low-volume traffic favors serverless (Modal, Replicate). API-only applications with no custom models should use managed endpoints (OpenAI, Bedrock, Azure OpenAI) to avoid infrastructure management entirely.
Knowledge Check
1. Why should you use AsyncOpenAI instead of the synchronous client inside FastAPI endpoints?
Show Answer
2. What is the main advantage of SSE over WebSockets for LLM chat streaming?
Show Answer
3. When would you choose Modal over AWS SageMaker for model deployment?
Show Answer
4. What does a health check endpoint verify in a containerized LLM deployment?
Show Answer
5. Why is Docker Compose useful for local LLM application development?
Show Answer
Key Takeaways
- Separate API, business logic, and inference layers so each can scale and version independently.
- Use SSE for standard chat streaming; reserve WebSockets for bidirectional use cases like real-time collaboration or voice.
- LitServe provides a minimal, high-performance framework for serving custom models with batching and GPU management.
- AWS Bedrock, GCP Vertex AI, and Azure OpenAI offer managed model endpoints that eliminate infrastructure management for API-based models.
- Serverless platforms (Modal, Replicate, HF Inference Endpoints) are ideal for bursty workloads, while dedicated endpoints suit steady production traffic.
- Containerize with Docker Compose for local development, then deploy to cloud container orchestrators (ECS, GKE, ACA) for production.