Serving an LLM in production is 10% machine learning and 90% convincing the infrastructure that thousands of users can share one very expensive GPU without anyone noticing the wait.
A Load-Balanced Request QueueFrom model weights to production endpoint. A trained model is just a collection of tensors on disk. Turning it into a responsive, scalable API requires specialized serving infrastructure that handles continuous batching, KV cache management, request scheduling, model parallelism, and hardware-specific kernel optimization. This section surveys the major serving frameworks available today, explains their architectural differences, and provides practical guidance for choosing the right tool for your deployment scenario. We conclude with a benchmarking methodology so you can make data-driven decisions for your own workloads.
This section ties together all previous Module 08 concepts: quantization (Section 8.1), KV cache management (Section 8.2), and speculative decoding (Section 8.3). Serving frameworks combine these techniques. Understanding of continuous batching requires familiarity with autoregressive generation from Section 5.1.
1. The Serving Stack
An LLM serving system sits between the raw model weights and the HTTP API that clients consume. It manages several critical responsibilities that are absent from a naive model.generate() call:
- Request scheduling: Ordering incoming requests, managing priority queues, and enforcing rate limits.
- Continuous batching: Dynamically adding and removing sequences from the active batch at each iteration (as covered in Section 8.2).
- KV cache management: Allocating, sharing, and evicting KV cache blocks using PagedAttention or similar techniques.
- Model parallelism: Distributing model weights across multiple GPUs using tensor parallelism (splitting layers) or pipeline parallelism (splitting stages).
- Kernel optimization: Using hardware-specific fused kernels (FlashAttention, custom GEMM) to maximize throughput.
- API layer: Exposing an OpenAI-compatible HTTP endpoint with streaming support.
2. vLLM
vLLM (Kwon et al., 2023) is the most widely adopted open-source LLM serving framework. Its core innovation is PagedAttention (covered in Section 8.2), which enables near-zero-waste KV cache memory management. Beyond PagedAttention, vLLM provides continuous batching, tensor parallelism, prefix caching, speculative decoding support, and an OpenAI-compatible API server.
Key features:
- PagedAttention: Block-based KV cache allocation with copy-on-write for shared prefixes.
- Continuous batching: Iteration-level scheduling that adds new requests as soon as GPU slots become available.
- Quantization support: GPTQ, AWQ, FP8, and GGUF formats.
- Tensor parallelism: Splits model layers across GPUs for large models.
- OpenAI-compatible API: Drop-in replacement for OpenAI endpoints with
/v1/completionsand/v1/chat/completions.
# Example 1: Launch vLLM server and benchmark throughput
# Terminal: Start the vLLM server
# $ python -m vllm.entrypoints.openai.api_server \
# --model meta-llama/Llama-3.1-8B-Instruct \
# --dtype float16 \
# --max-model-len 8192 \
# --gpu-memory-utilization 0.90 \
# --enable-prefix-caching \
# --port 8000
# Python: Benchmark with concurrent requests
import asyncio
import aiohttp
import time
import json
async def send_request(session, url, prompt, max_tokens=128):
payload = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.7,
}
t0 = time.perf_counter()
async with session.post(url, json=payload) as resp:
result = await resp.json()
elapsed = time.perf_counter() - t0
n_tokens = result["usage"]["completion_tokens"]
return n_tokens, elapsed
async def benchmark(concurrency=16, total_requests=64):
url = "http://localhost:8000/v1/chat/completions"
prompts = [
"Explain quantum entanglement simply.",
"Write a Python function to merge two sorted lists.",
"What causes the northern lights?",
"Describe the water cycle in four steps.",
]
connector = aiohttp.TCPConnector(limit=concurrency)
async with aiohttp.ClientSession(connector=connector) as session:
tasks = [
send_request(session, url, prompts[i % len(prompts)])
for i in range(total_requests)
]
t_start = time.perf_counter()
results = await asyncio.gather(*tasks)
t_total = time.perf_counter() - t_start
total_tokens = sum(r[0] for r in results)
latencies = [r[1] for r in results]
print(f"Concurrency: {concurrency}")
print(f"Total requests: {total_requests}")
print(f"Total time: {t_total:.2f}s")
print(f"Total tokens: {total_tokens}")
print(f"Throughput: {total_tokens/t_total:.1f} tok/s")
print(f"Avg latency: {sum(latencies)/len(latencies):.2f}s")
print(f"P50 latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
print(f"P99 latency: {sorted(latencies)[int(0.99*len(latencies))]:.2f}s")
asyncio.run(benchmark(concurrency=16, total_requests=64))
3. SGLang
SGLang (Zheng et al., 2024) is a serving framework built around the concept of RadixAttention, a tree-based data structure for efficient prefix sharing. While vLLM supports prefix caching, SGLang makes it a first-class design principle. The RadixAttention tree stores all cached KV blocks in a radix tree indexed by token sequences, enabling automatic longest-prefix matching across all active and recently-completed requests.
SGLang also introduces a programming model for structured generation. Its frontend DSL allows users to define complex generation patterns (multi-turn conversations, branching logic, constrained decoding) as Python programs, and the runtime optimizes execution across these patterns. This is particularly effective for workloads that share long system prompts or few-shot examples across many requests.
Key advantages over vLLM:
- RadixAttention: Automatic, fine-grained prefix sharing with LRU eviction. No manual prefix configuration needed.
- Structured generation: Native support for constrained output (JSON schema, regex) with minimal overhead.
- Multi-modal support: Built-in handling of vision-language models.
- Faster scheduling: Lightweight C++ scheduler with lower per-request overhead.
For most deployments, both vLLM and SGLang deliver comparable throughput. SGLang tends to outperform when workloads involve heavy prefix sharing (many requests with the same long system prompt) or structured output constraints. vLLM has broader ecosystem support, more quantization options, and a larger community. In production, it is worth benchmarking both on your specific workload before committing to one.
4. TGI (Text Generation Inference)
Hugging Face Text Generation Inference (TGI) is a production-ready serving solution tightly integrated with the Hugging Face ecosystem. It is the engine behind the Hugging Face Inference API and Inference Endpoints. TGI implements continuous batching, FlashAttention, tensor parallelism, and quantization (GPTQ, AWQ, BitsAndBytes, EETQ, FP8).
Distinguishing features:
- Hugging Face integration: Load any model from the Hub with zero configuration. Tokenizer handling, chat template application, and prompt formatting are automatic.
- Guidance/Outlines support: Built-in constrained decoding for JSON schemas and regex patterns.
- Docker-first deployment: Official Docker images with GPU support, simplifying production deployment.
- Watermarking: Built-in support for text watermarking (A Watermark for Large Language Models, Kirchenbauer et al., 2023).
# Example 2: Deploy TGI with Docker and query it
# Terminal: Launch TGI container
# $ docker run --gpus all --shm-size 1g -p 8080:80 \
# -v $PWD/data:/data \
# ghcr.io/huggingface/text-generation-inference:latest \
# --model-id meta-llama/Llama-3.1-8B-Instruct \
# --max-input-tokens 4096 \
# --max-total-tokens 8192 \
# --max-batch-prefill-tokens 8192 \
# --quantize awq
# Python: Query TGI with the text-generation client
from huggingface_hub import InferenceClient
import time
client = InferenceClient("http://localhost:8080")
# Single request with streaming
prompt = "Explain how PagedAttention works in three sentences."
# Non-streaming: measure time-to-first-token and total latency
t0 = time.perf_counter()
response = client.text_generation(
prompt,
max_new_tokens=150,
temperature=0.7,
details=True,
)
elapsed = time.perf_counter() - t0
print(f"Generated text: {response.generated_text[:120]}...")
print(f"Tokens generated: {len(response.details.tokens)}")
print(f"Total latency: {elapsed:.2f}s")
print(f"Tokens/sec: {len(response.details.tokens)/elapsed:.1f}")
# Streaming: measure time-to-first-token
t0 = time.perf_counter()
first_token_time = None
token_count = 0
for token in client.text_generation(
prompt, max_new_tokens=100, temperature=0.7, stream=True
):
if first_token_time is None:
first_token_time = time.perf_counter() - t0
token_count += 1
total_time = time.perf_counter() - t0
print(f"\nStreaming metrics:")
print(f"Time to first token (TTFT): {first_token_time*1000:.0f}ms")
print(f"Total tokens: {token_count}")
print(f"Tokens per second (TPS): {token_count/total_time:.1f}")
5. TensorRT-LLM
TensorRT-LLM is NVIDIA's inference optimization library, purpose-built for NVIDIA GPUs. Unlike the Python-first frameworks above, TensorRT-LLM compiles model graphs into highly optimized CUDA kernels using NVIDIA's TensorRT compiler. This compilation step produces hardware-specific code that exploits features like FP8 Transformer Engines on Hopper GPUs, custom GEMM kernels tuned for specific matrix shapes, and kernel fusion patterns that eliminate memory round-trips.
Performance characteristics:
- 30% to 50% higher throughput than vLLM at high concurrency on H100 GPUs, due to hardware-specific kernel optimization.
- FP8 support with minimal accuracy loss, leveraging Hopper's native FP8 Tensor Cores.
- In-flight batching: NVIDIA's term for continuous batching with iteration-level scheduling.
- Multi-GPU: Tensor and pipeline parallelism across multiple GPUs and nodes.
The tradeoff is complexity. TensorRT-LLM requires a model compilation step that can take 10 to 30 minutes, model support must be explicitly added (not all Hugging Face models are supported out of the box), and debugging is more difficult than with Python-based frameworks. It is best suited for production deployments on NVIDIA hardware where maximum throughput justifies the setup cost.
# TensorRT-LLM: Build and run (requires NVIDIA GPU + tensorrt_llm installed)
# Step 1: Convert model to TensorRT-LLM format
# python convert_checkpoint.py --model_dir meta-llama/Llama-3.1-8B \
# --output_dir ./trt_ckpt --dtype float16
# Step 2: Build the engine (compilation step, ~15 min)
# trtllm-build --checkpoint_dir ./trt_ckpt \
# --output_dir ./trt_engine \
# --gemm_plugin float16 \
# --max_batch_size 32 \
# --max_input_len 4096 \
# --max_seq_len 8192
# Step 3: Run inference with the compiled engine
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner
runner = ModelRunner.from_dir("./trt_engine")
outputs = runner.generate(
["Explain the advantage of compiled inference engines:"],
max_new_tokens=100,
temperature=0.7
)
print(outputs[0])
6. LMDeploy
LMDeploy (developed by the InternLM team at Shanghai AI Lab) is a serving framework with a strong focus on quantization and a custom inference backend called TurboMind. TurboMind implements its own attention kernels and KV cache management, with particularly strong support for W4A16 quantization (4-bit weights, 16-bit activations). LMDeploy achieves competitive throughput with vLLM and TGI while offering a simpler deployment experience for quantized models.
Notable features:
- TurboMind backend: Custom C++ inference engine with optimized attention and GEMM kernels.
- Strong quantization: W4A16 (AWQ), W8A8, and KV cache INT8/INT4 quantization.
- VLM support: First-class support for vision-language models (InternVL, LLaVA).
- Pipeline parallelism: Efficient multi-GPU serving with pipeline stages.
7. Ollama and llama.cpp: Local Inference
Not every deployment requires a GPU cluster. For local development, prototyping, and privacy-sensitive applications, Ollama and llama.cpp provide efficient inference on consumer hardware, including CPU-only machines and Apple Silicon Macs.
7.1 llama.cpp
llama.cpp (Gerganov, 2023) is a C/C++ implementation of LLM inference with minimal dependencies. It supports a wide range of hardware through multiple backends: CUDA (NVIDIA GPUs), Metal (Apple Silicon), Vulkan (cross-platform GPU), and optimized CPU paths (AVX2, ARM NEON). Its GGUF quantization format supports 2-bit through 8-bit quantization with per-block scaling, enabling 7B models to run at interactive speeds on a laptop CPU.
7.2 Ollama
Ollama wraps llama.cpp in a user-friendly interface with a model registry, automatic model downloading, and a simple REST API. It manages model lifecycle (downloading, loading, unloading) and exposes an API compatible with the OpenAI format. Ollama is the easiest path from zero to running an LLM locally.
# Example 3: Local inference with Ollama
# Terminal: Pull and run a model
# $ ollama pull llama3.1:8b-instruct-q4_K_M
# $ ollama serve # starts the API server on port 11434
# Python: Query Ollama's OpenAI-compatible API
import requests
import time
def ollama_generate(prompt, model="llama3.1:8b-instruct-q4_K_M", max_tokens=128):
"""Query Ollama's local API."""
url = "http://localhost:11434/v1/chat/completions"
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.7,
"stream": False,
}
t0 = time.perf_counter()
resp = requests.post(url, json=payload)
elapsed = time.perf_counter() - t0
data = resp.json()
return data, elapsed
# Benchmark on different quantizations
prompts = [
"Write a haiku about machine learning.",
"Explain the difference between TCP and UDP in two sentences.",
"What is the capital of Australia and why was it chosen?",
]
models = [
"llama3.1:8b-instruct-q4_K_M", # 4-bit, ~4.9 GB
"llama3.1:8b-instruct-q8_0", # 8-bit, ~8.5 GB
]
print(f"{'Model':<38} {'Tokens':>7} {'Time':>7} {'Tok/s':>7}")
print("-" * 62)
for model in models:
total_tokens, total_time = 0, 0
for prompt in prompts:
data, elapsed = ollama_generate(prompt, model=model)
n_tok = data["usage"]["completion_tokens"]
total_tokens += n_tok
total_time += elapsed
avg_tps = total_tokens / total_time
print(f"{model:<38} {total_tokens:>7} {total_time:>6.1f}s {avg_tps:>6.1f}")
Running a Q4 model locally on a MacBook Pro M3 achieves roughly 30 to 40 tokens per second for an 8B model. A cloud vLLM deployment on an A100 achieves 600+ tokens per second for the same model at high concurrency. The 15x to 20x gap reflects the fundamental hardware difference, but for single-user applications (coding assistants, private document analysis, prototyping), local inference offers zero latency overhead, complete privacy, and no API costs.
8. Triton Inference Server
NVIDIA Triton Inference Server is a production-grade model serving platform designed for multi-model, multi-framework deployments. Unlike the LLM-specific frameworks above, Triton is a general-purpose inference server that can host any model format (TensorRT, ONNX, PyTorch, TensorFlow) behind a unified gRPC/HTTP API. For LLM workloads, Triton integrates with TensorRT-LLM as a backend, combining Triton's production features (model versioning, A/B testing, ensemble pipelines, health checks, metrics) with TensorRT-LLM's optimized inference.
Triton is the right choice when you need enterprise-grade operational features: dynamic model loading/unloading, GPU sharing across multiple models, Prometheus metrics export, Kubernetes-native health probes, and support for pre/post-processing pipelines. For simpler LLM-only deployments, vLLM or SGLang's built-in API servers are sufficient.
9. Framework Comparison
| Framework | Best For | KV Cache | Quantization | Ease of Use |
|---|---|---|---|---|
| vLLM | General-purpose GPU serving | PagedAttention | GPTQ, AWQ, FP8, GGUF | High |
| SGLang | Prefix-heavy, structured output | RadixAttention | GPTQ, AWQ, FP8 | High |
| TGI | HF ecosystem, quick deployment | FlashAttention | GPTQ, AWQ, BnB, EETQ, FP8 | Very High |
| TensorRT-LLM | Max throughput on NVIDIA HW | Custom paged | FP8, INT8, INT4 (native) | Low |
| LMDeploy | Quantized models, VLMs | TurboMind | W4A16, W8A8, KV INT4/8 | Medium |
| Ollama | Local dev, privacy, prototyping | llama.cpp | GGUF (Q2 through Q8) | Very High |
| llama.cpp | Minimal footprint, edge/CPU | Custom | GGUF (Q2 through Q8) | Medium |
| Triton Server | Enterprise, multi-model | Via TRT-LLM backend | All TRT-LLM formats | Low |
10. Benchmarking Methodology
Choosing a serving framework requires benchmarking on your specific workload. The key metrics for LLM serving are:
Latency Metrics:
- Time to First Token (TTFT): The delay from request submission to the first token being returned. This is dominated by the prefill phase (processing the input prompt). TTFT matters most for interactive applications.
- Time Per Output Token (TPOT): The average time between consecutive output tokens during the decode phase. Also called inter-token latency. This determines the perceived "typing speed" for streaming applications.
- End-to-end latency: Total time from request to completion. Equals TTFT + (output_length × TPOT).
Throughput Metrics:
- Tokens per second (tok/s): Total output tokens generated across all concurrent requests per unit time. This is the primary capacity metric.
- Requests per second (RPS): Number of completed requests per second. Varies with output length, so tok/s is generally more informative.
Always benchmark under realistic conditions. Common mistakes include: (1) testing with uniform prompt/output lengths when your real traffic is variable, (2) warming up the server before measuring (caches and JIT compilation change behavior), (3) ignoring the latency-throughput tradeoff (a system that maximizes tok/s at 16 concurrent requests may have unacceptable TTFT at 128 concurrent requests), and (4) comparing frameworks at different quantization levels. Use tools like vllm benchmark, genai-perf, or custom load generators that replay realistic traffic patterns.
# Example 4: Comprehensive benchmarking script
import asyncio
import aiohttp
import time
import numpy as np
from dataclasses import dataclass
@dataclass
class RequestResult:
ttft: float # time to first token (seconds)
total_time: float # end-to-end latency
output_tokens: int # number of generated tokens
tpot: float # time per output token (after first)
async def benchmark_streaming(url, model, prompt, max_tokens=128):
"""Measure TTFT and TPOT via streaming response."""
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": max_tokens,
"temperature": 0.7,
"stream": True,
}
t0 = time.perf_counter()
ttft = None
token_count = 0
async with aiohttp.ClientSession() as session:
async with session.post(url, json=payload) as resp:
async for line in resp.content:
decoded = line.decode().strip()
if decoded.startswith("data: ") and decoded != "data: [DONE]":
if ttft is None:
ttft = time.perf_counter() - t0
token_count += 1
total = time.perf_counter() - t0
decode_time = total - (ttft or total)
tpot = decode_time / max(token_count - 1, 1)
return RequestResult(ttft=ttft or 0, total_time=total,
output_tokens=token_count, tpot=tpot)
async def run_benchmark(concurrency_levels=[1, 4, 8, 16, 32]):
url = "http://localhost:8000/v1/chat/completions"
model = "meta-llama/Llama-3.1-8B-Instruct"
prompt = "Describe the process of photosynthesis."
print(f"{'Concur':>7} {'Throughput':>11} {'Avg TTFT':>10} "
f"{'P50 TTFT':>10} {'Avg TPOT':>10} {'P99 Lat':>10}")
print("-" * 62)
for c in concurrency_levels:
n_requests = max(c * 4, 16)
sem = asyncio.Semaphore(c)
async def bounded_req():
async with sem:
return await benchmark_streaming(url, model, prompt)
tasks = [bounded_req() for _ in range(n_requests)]
t0 = time.perf_counter()
results = await asyncio.gather(*tasks)
wall_time = time.perf_counter() - t0
total_tok = sum(r.output_tokens for r in results)
ttfts = [r.ttft for r in results]
lats = [r.total_time for r in results]
tpots = [r.tpot for r in results]
print(f"{c:>7} {total_tok/wall_time:>10.1f}/s "
f"{np.mean(ttfts)*1000:>9.0f}ms "
f"{np.percentile(ttfts, 50)*1000:>9.0f}ms "
f"{np.mean(tpots)*1000:>9.0f}ms "
f"{np.percentile(lats, 99):>9.2f}s")
asyncio.run(run_benchmark())
Notice how throughput increases from 48 tok/s at concurrency 1 to 894 tok/s at concurrency 32 (an 18x improvement), while TTFT degrades from 87ms to 687ms (an 8x increase). This is the fundamental latency-throughput tradeoff in LLM serving. Higher concurrency fills the GPU more efficiently, improving throughput, but each individual request waits longer as it shares GPU cycles with more peers. Production systems must be tuned to maintain acceptable TTFT under expected load.
Check Your Understanding
1. What is the key architectural innovation that differentiates SGLang from vLLM?
Show Answer
2. Why does TensorRT-LLM achieve higher throughput than Python-based frameworks at high concurrency?
Show Answer
3. Explain the difference between TTFT and TPOT, and which optimization strategies target each.
Show Answer
4. When would you choose Ollama over vLLM for serving an 8B model?
Show Answer
Key Takeaways
- The LLM serving stack comprises request scheduling, continuous batching, KV cache management, model parallelism, kernel optimization, and an API layer. Each framework makes different tradeoffs across these layers.
- vLLM is the most widely adopted framework, built on PagedAttention, with broad model and quantization support. It is the default choice for most GPU-based deployments.
- SGLang excels at prefix-heavy workloads through RadixAttention and offers native structured generation support.
- TGI integrates tightly with the Hugging Face ecosystem and offers the simplest Docker-based deployment path.
- TensorRT-LLM delivers the highest throughput on NVIDIA hardware (30% to 50% over vLLM) at the cost of compilation complexity and reduced flexibility.
- Ollama and llama.cpp enable local inference on consumer hardware (CPU, Apple Silicon, consumer GPUs) with GGUF quantization, ideal for development and privacy-sensitive applications.
- Benchmark on your workload using realistic traffic patterns. Measure TTFT, TPOT, and throughput at multiple concurrency levels to understand the latency-throughput tradeoff for your specific deployment.