Module 08: Inference Optimization & Efficient Serving

Chapter Overview

Training a large language model is only half the challenge. The other half is making inference fast enough and affordable enough to serve real users. A 70-billion-parameter model consumes over 140 GB of GPU memory at full precision, generates tokens one at a time, and must maintain an ever-growing cache of key/value tensors for each active request. Without optimization, serving LLMs at scale is prohibitively expensive.

This chapter covers the four pillars of inference optimization. First, quantization reduces the precision of model weights (and sometimes activations) so that models fit on fewer GPUs and run faster. Second, KV cache and memory optimization techniques such as PagedAttention, grouped-query attention, and prefix caching eliminate memory waste and boost throughput. Third, speculative decoding breaks the sequential token-generation bottleneck by drafting multiple tokens at once and verifying them in parallel. Finally, serving infrastructure frameworks like vLLM, SGLang, TGI, and TensorRT-LLM tie everything together into production-ready systems that handle thousands of concurrent requests.

By the end of this module, you will understand the math behind each technique, know when to apply each one, and have hands-on experience quantizing models, profiling memory, implementing speculative decoding, and deploying high-throughput inference servers.

Learning Objectives

Explain the mathematics of absmax, zero-point, and per-group quantization; apply GPTQ, AWQ, and bitsandbytes to compress a 7B model to 4-bit
Calculate KV cache memory requirements and explain how PagedAttention eliminates fragmentation
Compare MHA, MQA, and GQA architectures and their effect on memory and throughput
Describe prefix caching, continuous batching, TTT layers, and DeepSeek Sparse Attention
Implement speculative decoding with rejection sampling and explain why it preserves the target distribution
Compare EAGLE and Medusa approaches to self-speculative decoding
Deploy and benchmark inference servers using vLLM, SGLang, TGI, and TensorRT-LLM
Profile and optimize end-to-end latency (TTFT and TPS) under realistic workloads

Prerequisites

Module 04: Transformer Architecture (attention mechanism, multi-head attention)
Module 05: Decoding Strategies (autoregressive generation, sampling methods)
Module 07: Modern LLM Landscape (Llama, Mistral, DeepSeek architecture details)
Basic familiarity with GPU memory hierarchy and CUDA concepts
Python, PyTorch, and Hugging Face Transformers library

Chapter Overview

Learning Objectives

Prerequisites

Sections