Module 08 · Section 8.4

Serving Infrastructure

vLLM, SGLang, TGI, TensorRT-LLM, LMDeploy, Ollama, llama.cpp, and production benchmarking

Serving an LLM in production is 10% machine learning and 90% convincing the infrastructure that thousands of users can share one very expensive GPU without anyone noticing the wait.

A Load-Balanced Request Queue
★ Big Picture

From model weights to production endpoint. A trained model is just a collection of tensors on disk. Turning it into a responsive, scalable API requires specialized serving infrastructure that handles continuous batching, KV cache management, request scheduling, model parallelism, and hardware-specific kernel optimization. This section surveys the major serving frameworks available today, explains their architectural differences, and provides practical guidance for choosing the right tool for your deployment scenario. We conclude with a benchmarking methodology so you can make data-driven decisions for your own workloads.

⚙ Prerequisites

This section ties together all previous Module 08 concepts: quantization (Section 8.1), KV cache management (Section 8.2), and speculative decoding (Section 8.3). Serving frameworks combine these techniques. Understanding of continuous batching requires familiarity with autoregressive generation from Section 5.1.

1. The Serving Stack

An LLM serving system sits between the raw model weights and the HTTP API that clients consume. It manages several critical responsibilities that are absent from a naive model.generate() call:

LLM Serving Stack Architecture HTTP Clients (OpenAI-compatible API) API Server (FastAPI/gRPC, streaming, tokenization) Request Scheduler (continuous batching, priority, preemption) KV Cache Manager (PagedAttention) Model Executor (TP/PP parallelism) GPU Kernels (FlashAttention, FP8 GEMM, fused ops, CUDA/Triton) GPU Hardware (H100, A100, L40S, RTX 4090, Apple Silicon)
Figure 8.7: The layers of an LLM serving stack, from HTTP clients down to GPU hardware. Each framework implements these layers with different design choices.

2. vLLM

vLLM (Kwon et al., 2023) is the most widely adopted open-source LLM serving framework. Its core innovation is PagedAttention (covered in Section 8.2), which enables near-zero-waste KV cache memory management. Beyond PagedAttention, vLLM provides continuous batching, tensor parallelism, prefix caching, speculative decoding support, and an OpenAI-compatible API server.

Key features:

# Example 1: Launch vLLM server and benchmark throughput
# Terminal: Start the vLLM server
# $ python -m vllm.entrypoints.openai.api_server \
#     --model meta-llama/Llama-3.1-8B-Instruct \
#     --dtype float16 \
#     --max-model-len 8192 \
#     --gpu-memory-utilization 0.90 \
#     --enable-prefix-caching \
#     --port 8000

# Python: Benchmark with concurrent requests
import asyncio
import aiohttp
import time
import json

async def send_request(session, url, prompt, max_tokens=128):
    payload = {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.7,
    }
    t0 = time.perf_counter()
    async with session.post(url, json=payload) as resp:
        result = await resp.json()
        elapsed = time.perf_counter() - t0
        n_tokens = result["usage"]["completion_tokens"]
        return n_tokens, elapsed

async def benchmark(concurrency=16, total_requests=64):
    url = "http://localhost:8000/v1/chat/completions"
    prompts = [
        "Explain quantum entanglement simply.",
        "Write a Python function to merge two sorted lists.",
        "What causes the northern lights?",
        "Describe the water cycle in four steps.",
    ]

    connector = aiohttp.TCPConnector(limit=concurrency)
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [
            send_request(session, url, prompts[i % len(prompts)])
            for i in range(total_requests)
        ]

        t_start = time.perf_counter()
        results = await asyncio.gather(*tasks)
        t_total = time.perf_counter() - t_start

        total_tokens = sum(r[0] for r in results)
        latencies = [r[1] for r in results]

        print(f"Concurrency:       {concurrency}")
        print(f"Total requests:    {total_requests}")
        print(f"Total time:        {t_total:.2f}s")
        print(f"Total tokens:      {total_tokens}")
        print(f"Throughput:        {total_tokens/t_total:.1f} tok/s")
        print(f"Avg latency:       {sum(latencies)/len(latencies):.2f}s")
        print(f"P50 latency:       {sorted(latencies)[len(latencies)//2]:.2f}s")
        print(f"P99 latency:       {sorted(latencies)[int(0.99*len(latencies))]:.2f}s")

asyncio.run(benchmark(concurrency=16, total_requests=64))
Concurrency: 16 Total requests: 64 Total time: 12.47s Total tokens: 7891 Throughput: 632.7 tok/s Avg latency: 3.02s P50 latency: 2.87s P99 latency: 4.91s

3. SGLang

SGLang (Zheng et al., 2024) is a serving framework built around the concept of RadixAttention, a tree-based data structure for efficient prefix sharing. While vLLM supports prefix caching, SGLang makes it a first-class design principle. The RadixAttention tree stores all cached KV blocks in a radix tree indexed by token sequences, enabling automatic longest-prefix matching across all active and recently-completed requests.

SGLang also introduces a programming model for structured generation. Its frontend DSL allows users to define complex generation patterns (multi-turn conversations, branching logic, constrained decoding) as Python programs, and the runtime optimizes execution across these patterns. This is particularly effective for workloads that share long system prompts or few-shot examples across many requests.

Key advantages over vLLM:

Note: Choosing Between vLLM and SGLang

For most deployments, both vLLM and SGLang deliver comparable throughput. SGLang tends to outperform when workloads involve heavy prefix sharing (many requests with the same long system prompt) or structured output constraints. vLLM has broader ecosystem support, more quantization options, and a larger community. In production, it is worth benchmarking both on your specific workload before committing to one.

4. TGI (Text Generation Inference)

Hugging Face Text Generation Inference (TGI) is a production-ready serving solution tightly integrated with the Hugging Face ecosystem. It is the engine behind the Hugging Face Inference API and Inference Endpoints. TGI implements continuous batching, FlashAttention, tensor parallelism, and quantization (GPTQ, AWQ, BitsAndBytes, EETQ, FP8).

Distinguishing features:

# Example 2: Deploy TGI with Docker and query it
# Terminal: Launch TGI container
# $ docker run --gpus all --shm-size 1g -p 8080:80 \
#     -v $PWD/data:/data \
#     ghcr.io/huggingface/text-generation-inference:latest \
#     --model-id meta-llama/Llama-3.1-8B-Instruct \
#     --max-input-tokens 4096 \
#     --max-total-tokens 8192 \
#     --max-batch-prefill-tokens 8192 \
#     --quantize awq

# Python: Query TGI with the text-generation client
from huggingface_hub import InferenceClient
import time

client = InferenceClient("http://localhost:8080")

# Single request with streaming
prompt = "Explain how PagedAttention works in three sentences."

# Non-streaming: measure time-to-first-token and total latency
t0 = time.perf_counter()
response = client.text_generation(
    prompt,
    max_new_tokens=150,
    temperature=0.7,
    details=True,
)
elapsed = time.perf_counter() - t0

print(f"Generated text: {response.generated_text[:120]}...")
print(f"Tokens generated: {len(response.details.tokens)}")
print(f"Total latency:    {elapsed:.2f}s")
print(f"Tokens/sec:       {len(response.details.tokens)/elapsed:.1f}")

# Streaming: measure time-to-first-token
t0 = time.perf_counter()
first_token_time = None
token_count = 0
for token in client.text_generation(
    prompt, max_new_tokens=100, temperature=0.7, stream=True
):
    if first_token_time is None:
        first_token_time = time.perf_counter() - t0
    token_count += 1
total_time = time.perf_counter() - t0

print(f"\nStreaming metrics:")
print(f"Time to first token (TTFT): {first_token_time*1000:.0f}ms")
print(f"Total tokens:               {token_count}")
print(f"Tokens per second (TPS):    {token_count/total_time:.1f}")
Generated text: PagedAttention divides the KV cache into fixed-size blocks that are allocated on demand, similar to virtual memory p... Tokens generated: 87 Total latency: 1.94s Tokens/sec: 44.8 Streaming metrics: Time to first token (TTFT): 142ms Total tokens: 67 Tokens per second (TPS): 41.3

5. TensorRT-LLM

TensorRT-LLM is NVIDIA's inference optimization library, purpose-built for NVIDIA GPUs. Unlike the Python-first frameworks above, TensorRT-LLM compiles model graphs into highly optimized CUDA kernels using NVIDIA's TensorRT compiler. This compilation step produces hardware-specific code that exploits features like FP8 Transformer Engines on Hopper GPUs, custom GEMM kernels tuned for specific matrix shapes, and kernel fusion patterns that eliminate memory round-trips.

Performance characteristics:

The tradeoff is complexity. TensorRT-LLM requires a model compilation step that can take 10 to 30 minutes, model support must be explicitly added (not all Hugging Face models are supported out of the box), and debugging is more difficult than with Python-based frameworks. It is best suited for production deployments on NVIDIA hardware where maximum throughput justifies the setup cost.

# TensorRT-LLM: Build and run (requires NVIDIA GPU + tensorrt_llm installed)
# Step 1: Convert model to TensorRT-LLM format
# python convert_checkpoint.py --model_dir meta-llama/Llama-3.1-8B \
#     --output_dir ./trt_ckpt --dtype float16

# Step 2: Build the engine (compilation step, ~15 min)
# trtllm-build --checkpoint_dir ./trt_ckpt \
#     --output_dir ./trt_engine \
#     --gemm_plugin float16 \
#     --max_batch_size 32 \
#     --max_input_len 4096 \
#     --max_seq_len 8192

# Step 3: Run inference with the compiled engine
import tensorrt_llm
from tensorrt_llm.runtime import ModelRunner

runner = ModelRunner.from_dir("./trt_engine")
outputs = runner.generate(
    ["Explain the advantage of compiled inference engines:"],
    max_new_tokens=100,
    temperature=0.7
)
print(outputs[0])
Explain the advantage of compiled inference engines: Compiled inference engines like TensorRT convert model operations into hardware-specific optimized code at build time. This eliminates the overhead of Python interpretation, enables kernel fusion (combining multiple operations into single GPU launches), and allows the compiler to select the best CUDA kernels for the specific GPU architecture and matrix dimensions being used.

6. LMDeploy

LMDeploy (developed by the InternLM team at Shanghai AI Lab) is a serving framework with a strong focus on quantization and a custom inference backend called TurboMind. TurboMind implements its own attention kernels and KV cache management, with particularly strong support for W4A16 quantization (4-bit weights, 16-bit activations). LMDeploy achieves competitive throughput with vLLM and TGI while offering a simpler deployment experience for quantized models.

Notable features:

7. Ollama and llama.cpp: Local Inference

Not every deployment requires a GPU cluster. For local development, prototyping, and privacy-sensitive applications, Ollama and llama.cpp provide efficient inference on consumer hardware, including CPU-only machines and Apple Silicon Macs.

7.1 llama.cpp

llama.cpp (Gerganov, 2023) is a C/C++ implementation of LLM inference with minimal dependencies. It supports a wide range of hardware through multiple backends: CUDA (NVIDIA GPUs), Metal (Apple Silicon), Vulkan (cross-platform GPU), and optimized CPU paths (AVX2, ARM NEON). Its GGUF quantization format supports 2-bit through 8-bit quantization with per-block scaling, enabling 7B models to run at interactive speeds on a laptop CPU.

7.2 Ollama

Ollama wraps llama.cpp in a user-friendly interface with a model registry, automatic model downloading, and a simple REST API. It manages model lifecycle (downloading, loading, unloading) and exposes an API compatible with the OpenAI format. Ollama is the easiest path from zero to running an LLM locally.

# Example 3: Local inference with Ollama
# Terminal: Pull and run a model
# $ ollama pull llama3.1:8b-instruct-q4_K_M
# $ ollama serve    # starts the API server on port 11434

# Python: Query Ollama's OpenAI-compatible API
import requests
import time

def ollama_generate(prompt, model="llama3.1:8b-instruct-q4_K_M", max_tokens=128):
    """Query Ollama's local API."""
    url = "http://localhost:11434/v1/chat/completions"
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "stream": False,
    }
    t0 = time.perf_counter()
    resp = requests.post(url, json=payload)
    elapsed = time.perf_counter() - t0
    data = resp.json()
    return data, elapsed

# Benchmark on different quantizations
prompts = [
    "Write a haiku about machine learning.",
    "Explain the difference between TCP and UDP in two sentences.",
    "What is the capital of Australia and why was it chosen?",
]

models = [
    "llama3.1:8b-instruct-q4_K_M",     # 4-bit, ~4.9 GB
    "llama3.1:8b-instruct-q8_0",         # 8-bit, ~8.5 GB
]

print(f"{'Model':<38} {'Tokens':>7} {'Time':>7} {'Tok/s':>7}")
print("-" * 62)
for model in models:
    total_tokens, total_time = 0, 0
    for prompt in prompts:
        data, elapsed = ollama_generate(prompt, model=model)
        n_tok = data["usage"]["completion_tokens"]
        total_tokens += n_tok
        total_time += elapsed
    avg_tps = total_tokens / total_time
    print(f"{model:<38} {total_tokens:>7} {total_time:>6.1f}s {avg_tps:>6.1f}")
Model Tokens Time Tok/s -------------------------------------------------------------- llama3.1:8b-instruct-q4_K_M 287 8.2s 35.0 llama3.1:8b-instruct-q8_0 291 14.7s 19.8
Key Insight: Local vs. Cloud Tradeoffs

Running a Q4 model locally on a MacBook Pro M3 achieves roughly 30 to 40 tokens per second for an 8B model. A cloud vLLM deployment on an A100 achieves 600+ tokens per second for the same model at high concurrency. The 15x to 20x gap reflects the fundamental hardware difference, but for single-user applications (coding assistants, private document analysis, prototyping), local inference offers zero latency overhead, complete privacy, and no API costs.

8. Triton Inference Server

NVIDIA Triton Inference Server is a production-grade model serving platform designed for multi-model, multi-framework deployments. Unlike the LLM-specific frameworks above, Triton is a general-purpose inference server that can host any model format (TensorRT, ONNX, PyTorch, TensorFlow) behind a unified gRPC/HTTP API. For LLM workloads, Triton integrates with TensorRT-LLM as a backend, combining Triton's production features (model versioning, A/B testing, ensemble pipelines, health checks, metrics) with TensorRT-LLM's optimized inference.

Triton is the right choice when you need enterprise-grade operational features: dynamic model loading/unloading, GPU sharing across multiple models, Prometheus metrics export, Kubernetes-native health probes, and support for pre/post-processing pipelines. For simpler LLM-only deployments, vLLM or SGLang's built-in API servers are sufficient.

9. Framework Comparison

Framework Best For KV Cache Quantization Ease of Use
vLLM General-purpose GPU serving PagedAttention GPTQ, AWQ, FP8, GGUF High
SGLang Prefix-heavy, structured output RadixAttention GPTQ, AWQ, FP8 High
TGI HF ecosystem, quick deployment FlashAttention GPTQ, AWQ, BnB, EETQ, FP8 Very High
TensorRT-LLM Max throughput on NVIDIA HW Custom paged FP8, INT8, INT4 (native) Low
LMDeploy Quantized models, VLMs TurboMind W4A16, W8A8, KV INT4/8 Medium
Ollama Local dev, privacy, prototyping llama.cpp GGUF (Q2 through Q8) Very High
llama.cpp Minimal footprint, edge/CPU Custom GGUF (Q2 through Q8) Medium
Triton Server Enterprise, multi-model Via TRT-LLM backend All TRT-LLM formats Low

10. Benchmarking Methodology

Choosing a serving framework requires benchmarking on your specific workload. The key metrics for LLM serving are:

Latency Metrics:

Throughput Metrics:

LLM Latency Anatomy Prefill Decode (token by token) TTFT (Time to First Token) TPOT (Time Per Output Token) End-to-End Latency = TTFT + (N_output × TPOT)
Figure 8.8: LLM request latency comprises the prefill phase (TTFT) and the decode phase (N tokens, each taking TPOT). Optimizing these independently requires different strategies.
Warning: Benchmarking Pitfalls

Always benchmark under realistic conditions. Common mistakes include: (1) testing with uniform prompt/output lengths when your real traffic is variable, (2) warming up the server before measuring (caches and JIT compilation change behavior), (3) ignoring the latency-throughput tradeoff (a system that maximizes tok/s at 16 concurrent requests may have unacceptable TTFT at 128 concurrent requests), and (4) comparing frameworks at different quantization levels. Use tools like vllm benchmark, genai-perf, or custom load generators that replay realistic traffic patterns.

# Example 4: Comprehensive benchmarking script
import asyncio
import aiohttp
import time
import numpy as np
from dataclasses import dataclass

@dataclass
class RequestResult:
    ttft: float           # time to first token (seconds)
    total_time: float     # end-to-end latency
    output_tokens: int    # number of generated tokens
    tpot: float           # time per output token (after first)

async def benchmark_streaming(url, model, prompt, max_tokens=128):
    """Measure TTFT and TPOT via streaming response."""
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": max_tokens,
        "temperature": 0.7,
        "stream": True,
    }

    t0 = time.perf_counter()
    ttft = None
    token_count = 0

    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=payload) as resp:
            async for line in resp.content:
                decoded = line.decode().strip()
                if decoded.startswith("data: ") and decoded != "data: [DONE]":
                    if ttft is None:
                        ttft = time.perf_counter() - t0
                    token_count += 1

    total = time.perf_counter() - t0
    decode_time = total - (ttft or total)
    tpot = decode_time / max(token_count - 1, 1)

    return RequestResult(ttft=ttft or 0, total_time=total,
                         output_tokens=token_count, tpot=tpot)

async def run_benchmark(concurrency_levels=[1, 4, 8, 16, 32]):
    url = "http://localhost:8000/v1/chat/completions"
    model = "meta-llama/Llama-3.1-8B-Instruct"
    prompt = "Describe the process of photosynthesis."

    print(f"{'Concur':>7} {'Throughput':>11} {'Avg TTFT':>10} "
          f"{'P50 TTFT':>10} {'Avg TPOT':>10} {'P99 Lat':>10}")
    print("-" * 62)

    for c in concurrency_levels:
        n_requests = max(c * 4, 16)
        sem = asyncio.Semaphore(c)

        async def bounded_req():
            async with sem:
                return await benchmark_streaming(url, model, prompt)

        tasks = [bounded_req() for _ in range(n_requests)]
        t0 = time.perf_counter()
        results = await asyncio.gather(*tasks)
        wall_time = time.perf_counter() - t0

        total_tok = sum(r.output_tokens for r in results)
        ttfts = [r.ttft for r in results]
        lats = [r.total_time for r in results]
        tpots = [r.tpot for r in results]

        print(f"{c:>7} {total_tok/wall_time:>10.1f}/s "
              f"{np.mean(ttfts)*1000:>9.0f}ms "
              f"{np.percentile(ttfts, 50)*1000:>9.0f}ms "
              f"{np.mean(tpots)*1000:>9.0f}ms "
              f"{np.percentile(lats, 99):>9.2f}s")

asyncio.run(run_benchmark())
Concur Throughput Avg TTFT P50 TTFT Avg TPOT P99 Lat -------------------------------------------------------------- 1 48.3/s 87ms 85ms 21ms 2.78s 4 187.2/s 112ms 108ms 22ms 3.01s 8 351.6/s 189ms 175ms 23ms 3.42s 16 632.7/s 342ms 315ms 25ms 4.91s 32 894.1/s 687ms 621ms 30ms 7.23s
Key Insight: The Latency-Throughput Tradeoff

Notice how throughput increases from 48 tok/s at concurrency 1 to 894 tok/s at concurrency 32 (an 18x improvement), while TTFT degrades from 87ms to 687ms (an 8x increase). This is the fundamental latency-throughput tradeoff in LLM serving. Higher concurrency fills the GPU more efficiently, improving throughput, but each individual request waits longer as it shares GPU cycles with more peers. Production systems must be tuned to maintain acceptable TTFT under expected load.

Check Your Understanding

1. What is the key architectural innovation that differentiates SGLang from vLLM?

Show Answer
SGLang is built around RadixAttention, a radix-tree-based KV cache management system that enables automatic, fine-grained prefix sharing across all requests. While vLLM supports prefix caching as an optional feature, SGLang makes it a core design principle. The radix tree indexes cached KV blocks by their token sequences, allowing automatic longest-prefix matching with LRU eviction. This provides the largest benefit when many requests share common long prefixes (system prompts, few-shot examples).

2. Why does TensorRT-LLM achieve higher throughput than Python-based frameworks at high concurrency?

Show Answer
TensorRT-LLM compiles model graphs into hardware-specific CUDA kernels that are optimized for the exact GPU architecture (e.g., Hopper's FP8 Tensor Cores). This compilation step produces fused kernels tuned for specific matrix shapes, eliminates Python interpreter overhead, and exploits hardware features that generic kernels cannot access. The result is 30% to 50% higher throughput at high concurrency, where per-kernel efficiency matters most. The tradeoff is a longer setup time (10 to 30 minutes for compilation) and reduced flexibility.

3. Explain the difference between TTFT and TPOT, and which optimization strategies target each.

Show Answer
TTFT (Time to First Token) measures the delay from request submission until the first output token is returned. It is dominated by the prefill phase, which processes the entire input prompt. TTFT is improved by faster prefill computation (FlashAttention, chunked prefill, prefix caching). TPOT (Time Per Output Token) measures the inter-token latency during the decode phase and is dominated by memory bandwidth (reading model weights and KV cache for each token). TPOT is improved by quantization (smaller weights to read), GQA (smaller KV cache), and speculative decoding (multiple tokens per forward pass). In streaming applications, TTFT determines how quickly users see a response begin, while TPOT determines the perceived "typing speed."

4. When would you choose Ollama over vLLM for serving an 8B model?

Show Answer
Ollama is the better choice for: (1) local development and prototyping, where simplicity and zero setup time matter more than throughput; (2) privacy-sensitive applications where data must not leave the local machine; (3) CPU-only or Apple Silicon deployments where CUDA is unavailable; (4) single-user scenarios where high concurrency throughput is unnecessary. vLLM is better when serving multiple concurrent users, when running on NVIDIA GPUs, when maximizing throughput is critical, or when features like prefix caching and speculative decoding are needed.

Key Takeaways