Building Conversational AI using LLM and Agents

A comprehensive, hands-on course covering the full stack of modern Large Language Model technology: from foundational NLP to production-grade AI agent systems.

📚 28 Modules + Capstone + 10 Appendices ⏱ ~190 Hours 💻 Project-Based Beginner to Advanced 🤖 Writing Team (36 AI Agents)

Prerequisites

Learning Outcomes

🟢 Basic
🟡 Intermediate
🔴 Advanced
📐 Fundamentals
⚙️ Engineering
🔬 Research
🔧 Lab

Table of Contents

Part I: Foundations
00 ML & PyTorch Foundations 01 Foundations of NLP & Text Representation 02 Tokenization & Subword Models 03 Sequence Models & the Attention Mechanism 04 The Transformer Architecture 05 Decoding Strategies & Text Generation
Part II: Understanding LLMs
06 Pre-training, Scaling Laws & Data Curation 07 Modern LLM Landscape & Model Internals 08 Inference Optimization & Efficient Serving
Part III: Working with LLMs
09 Working with LLM APIs 10 Prompt Engineering & Advanced Techniques 11 Hybrid ML+LLM Architectures & Decision Frameworks
Part IV: Training & Adapting
12 Synthetic Data Generation & LLM Simulation 13 Fine-Tuning Fundamentals 14 Parameter-Efficient Fine-Tuning (PEFT) 15 Knowledge Distillation & Model Merging 16 Alignment: RLHF, DPO & Preference Tuning 17 Interpretability & Mechanistic Understanding
Part V: Retrieval & Conversation
18 Embeddings, Vector Databases & Semantic Search 19 Retrieval-Augmented Generation (RAG) 20 Building Conversational AI Systems
Part VI: Agents & Applications
21 AI Agents: Tool Use, Planning & Reasoning 22 Multi-Agent Systems & Orchestration 23 Multimodal Generation 24 LLM Applications: Vibe-Coding, Finance, Healthcare & Beyond 25 Evaluation, Experiment Design & Observability
Part VII: Production & Strategy
26 Production Deployment, Safety & Ethics 27 LLM Strategy, Product Management & ROI
Appendices
A Mathematical Foundations B Machine Learning Essentials C Python for LLM Development D Environment Setup & Cloud Provisioning E Git & Collaboration for ML Projects F Glossary of Terms G Hardware & Compute Reference H Model Card Quick Reference I Prompt Template Library J Dataset & Benchmark Reference
Part I: Foundations
Module 00

ML & PyTorch Foundations

Prerequisite refresher covering core machine learning concepts and hands-on PyTorch programming. Ensures all students share a common foundation before diving into NLP and LLMs.

  • Feature engineering and representation
  • Supervised learning: classification and regression fundamentals
  • Loss functions and optimization: gradient descent, SGD, mini-batch SGD
  • Overfitting, underfitting, and regularization (L1, L2, dropout)
  • Bias-variance tradeoff and generalization theory
  • Cross-validation and model selection strategies
  • Neural network fundamentals: perceptrons, MLPs, activation functions
  • Backpropagation and the chain rule
  • Batch normalization, dropout, and weight initialization
  • Convolutional neural networks (CNNs) overview
  • Training best practices: learning rate scheduling, early stopping, gradient clipping
0.3 PyTorch Tutorial 🟢⚙️🔧
  • Comprehensive PyTorch introduction
  • Tensors: creation, indexing, broadcasting, device management (CPU/GPU)
  • Autograd: automatic differentiation, computational graphs, gradient accumulation
  • Building models with nn.Module: layers, parameters, forward pass
  • Data loading: Dataset, DataLoader, transforms, batching
  • Training loop pattern: forward, loss, backward, optimizer step
  • Saving and loading models: state_dict, checkpoints
  • Debugging: hooks, gradient inspection, profiling with torch.profiler
  • Lab: Build and train an image classifier in PyTorch from scratch; practice tensor operations, custom datasets, and the full training loop
  • The RL framework: agent, environment, state, action, reward, episode
  • Policy: mapping states to actions; deterministic vs. stochastic policies
  • Value functions: state-value V(s), action-value Q(s,a); the Bellman equation (intuition, not derivation)
  • Policy gradient theorem (intuition): adjusting the policy to increase the probability of actions that led to high rewards
  • PPO intuition: clipping the policy update to prevent destructive large changes; why this matters for LLM training
  • How RL connects to LLM training: the LLM is the policy, generating a token is an action, the reward model scores the output
  • This lesson provides the foundations for Module 16 (RLHF, DPO, RLVR)

Build intuition for how machines understand text: from bag-of-words to dense vector spaces. Covers classical and neural word representations that underpin all modern LLM work.

  • History of NLP: rule-based → statistical → neural → LLM era
  • Comprehensive NLP task taxonomy:
    • Text classification: sentiment analysis, intent detection, topic categorization, spam filtering
    • Sequence labeling: NER, POS tagging, chunking
    • Text generation: summarization (extractive vs. abstractive), machine translation, paraphrase generation
    • Question answering: extractive QA, generative QA, open-domain QA
    • Information extraction: relation extraction, event detection, slot filling
    • Semantic tasks: textual entailment, semantic similarity, natural language inference
    • Conversational AI: dialogue systems, task-oriented dialogue, open-domain chat
    • How LLMs are changing each task: from specialized models to unified generative approaches
  • Why language is hard: ambiguity, context, compositionality
  • Course roadmap and project overview
  • Text cleaning: Unicode normalization, regex, stop words, stemming, lemmatization
  • Bag-of-Words, TF-IDF, n-grams
  • Term vectors and TF-IDF in depth: term frequency saturation, inverse document frequency weighting, document length normalization, vector space model for retrieval
  • One-hot encoding and its limitations
  • Lab: Build a text preprocessing pipeline with spaCy and NLTK
  • Distributional hypothesis: "you shall know a word by the company it keeps"
  • Word2Vec: CBOW and Skip-gram architectures, negative sampling
  • GloVe: global matrix factorization approach
  • FastText: subword-level embeddings
  • Visualizing embeddings: t-SNE, UMAP, analogy tasks
  • Lab: Train Word2Vec on a custom corpus using Gensim; explore word analogies
  • Limitations of static embeddings (polysemy)
  • ELMo: bi-directional LSTM-based contextualized representations
  • Transfer learning in NLP: why pre-train?
  • Setting the stage for BERT and GPT

Tokenization is the critical first step of every LLM pipeline. Understand the algorithms behind BPE, WordPiece, and SentencePiece, and learn how tokenizer choice affects model behavior, cost, and multilingual capability.

  • From characters to tokens: the vocabulary tradeoff
  • Impact on context window, cost, and model performance
  • Tokenization artifacts and edge cases (numbers, code, CJK, emoji)
  • Byte Pair Encoding (BPE): algorithm, merge rules, vocab construction
  • BPE internals: merge table as priority queue: encoding uses greedy left-to-right merging; the merge tree data structure maps byte sequences to token IDs with O(n·log(n)) encoding complexity
  • WordPiece (BERT's tokenizer): MaxMatch algorithm, likelihood-based merging vs. frequency-based (BPE)
  • Unigram model (SentencePiece): probabilistic tokenization: Viterbi decoding finds most likely segmentation; EM training to prune vocabulary from large initial set
  • Byte-level BPE (GPT-2/GPT-4 style): base-256 vocabulary, no unknown tokens, universal UTF-8 coverage
  • Comparing tokenizers: tiktoken (Rust/Python, fast), Hugging Face tokenizers (Rust core), SentencePiece (C++)
  • Tokenizer-free / byte-level models: ByT5 (byte-to-byte), MegaByte (patch-based byte model), character-level approaches: tradeoffs: longer sequences but no vocabulary mismatch
  • Lab: Train a BPE tokenizer from scratch; visualize the merge tree; implement encoding step-by-step; compare token counts and fertility across models
  • Special tokens: [CLS], [SEP], [PAD], <|endoftext|>, chat templates
  • Tokenizer configuration for chat models (chat_template, apply_chat_template)
  • Multilingual tokenization: fertility rates, script coverage, cross-lingual vocab sharing
  • Multimodal tokenization: how vision and audio tokens work
  • Estimating token counts for API cost optimization
  • Lab: Inspect and compare tokenization of the same text (English, Chinese, Arabic, code) across GPT-4, Claude, Llama 3, and Gemma tokenizers

Trace the evolution from RNNs to the attention mechanism: the key breakthrough that enabled transformers. Build deep intuition for how attention works mathematically and conceptually.

  • RNN fundamentals: hidden state, sequential processing
  • LSTM and GRU: gating mechanisms
  • Bidirectional RNNs
  • Vanishing/exploding gradients and the long-range dependency problem
  • Encoder-decoder architecture for seq2seq tasks
3.2 The Attention Mechanism 🟡⚙️🔧
  • Intuition: "where to look" when generating output
  • Bahdanau (additive) attention
  • Luong (multiplicative / dot-product) attention
  • Attention weights as soft alignment
  • Backpropagation through attention: gradient flow through softmax (Jacobian structure), why attention gradients are dense; gradient of scaled dot-product w.r.t. Q, K, V
  • Attention as differentiable dictionary lookup: soft retrieval from value memory indexed by key similarity
  • Lab: Implement Bahdanau attention from scratch in PyTorch; manually compute gradients and verify with autograd; visualize attention heatmaps
  • Query, Key, Value formulation: linear projections WQ, WK, WV and their learned subspaces
  • Scaled dot-product attention: why scale by √dk: variance analysis of dot products in high dimensions
  • Softmax temperature and attention entropy: sharp vs. diffuse attention distributions
  • Multi-head attention: parallel subspace projections and the concatenation/projection output
  • Self-attention vs. cross-attention: when Q and KV come from different sequences
  • Causal (masked) attention: lower-triangular mask for autoregressive models
  • Attention complexity: O(n²d) compute, O(n²) memory: understanding the quadratic bottleneck
  • Lab: Implement multi-head self-attention from scratch with explicit matrix operations; verify against PyTorch nn.MultiheadAttention; visualize attention weight distributions and entropy

Deep dive into the full transformer architecture: the foundation of every modern LLM. Understand every component, from positional encoding to layer normalization, and implement one from scratch.

  • "Attention Is All You Need" paper walkthrough: original architecture and design rationale
  • Encoder and Decoder stacks: layer composition, information flow, residual stream hypothesis
  • Positional encoding internals: sinusoidal (frequency basis, rotation interpretation), learned embeddings, RoPE (rotation matrices, relative position via complex multiplication), ALiBi (linear bias slopes)
  • Feed-forward networks: expansion ratio (4x → 8/3x for SwiGLU), role as key-value memories (Geva et al.)
  • Activation functions: ReLU → GELU → SwiGLU: ablation evidence and why SwiGLU wins
  • Normalization: LayerNorm vs RMSNorm (computation, gradient flow); Pre-LN vs Post-LN (training stability analysis)
  • Weight initialization: Xavier/He schemes, scaled initialization for deep transformers, μP (maximal update parametrization)
  • Loss function: cross-entropy for next-token prediction, label smoothing, auxiliary losses
  • 📐 Information theory for language modeling: entropy as the theoretical lower bound on compression; cross-entropy loss is an upper bound on true entropy; perplexity = 2^(cross-entropy) measures how "surprised" the model is; KL divergence measures distribution mismatch between model and data; mutual information quantifies how much context reduces uncertainty about the next token
  • Residual connections as gradient highways: why transformers train better than deep RNNs
  • Implement a complete decoder-only transformer in PyTorch (~300 lines)
  • Token embeddings + RoPE positional encoding
  • Multi-head causal self-attention layer
  • Feed-forward layer with SwiGLU
  • Training loop on a small text corpus
  • Lab: Train a BPE-level mini-GPT; generate text samples; profile memory and compute
  • Encoder-only (BERT), Decoder-only (GPT), Encoder-Decoder (T5, BART): architectural comparison and when to use each
  • Encoder-decoder deep dive: cross-attention mechanism (queries from decoder, keys/values from encoder); T5 text-to-text framework; BART denoising pre-training; seq2seq fine-tuning for summarization, translation, and structured generation; why encoder-decoder excels at conditional generation tasks vs. decoder-only
  • Efficient attention: Flash Attention, Multi-Query Attention (MQA), Grouped-Query Attention (GQA)
  • Multi-head Latent Attention (MLA) as a general efficient attention technique: projects keys and values into a low-rank latent space before caching; reduces KV cache by 10x+ compared to standard MHA; mathematically: K_cache = W_down @ K (compress), K_restored = W_up @ K_cache (decompress at attention time); comparison with GQA and MQA: MLA achieves better quality at similar cache sizes
  • Sparse attention: Longformer, BigBird patterns
  • Linear attention and state-space models (Mamba, RWKV, Jamba)
  • 🔬 State Space Models in depth: the S4 lineage (S4, S5, S6/Mamba); continuous-time ODE formulation discretized into linear recurrences; the HiPPO framework for long-range dependency initialization; Mamba's selection mechanism (input-dependent state transitions, making the model data-dependent unlike fixed SSMs); Mamba-2 structured state space duality (connecting SSMs to attention); hybrid architectures: Jamba (Mamba + attention layers), Zamba; when SSMs match or beat transformers (long sequences, low latency) and when they fall short (in-context learning, complex retrieval)
  • Mixture of Experts (MoE) internals: expert FFN layers, gating network (top-k routing), auxiliary load-balancing loss, expert capacity factor; DeepSeek MoE: shared experts + routed experts, fine-grained expert segmentation
  • 🔬 Computational complexity of attention: O(n2d) time and O(n2) memory for standard attention; theoretical lower bounds (Fine-Grained Complexity perspective, Strong Exponential Time Hypothesis implications); whether sub-quadratic attention can be provably equivalent to full attention; IO complexity analysis underlying FlashAttention
  • 🔬 RWKV architecture internals: the WKV (Weighted Key-Value) mechanism replacing attention with linear-complexity recurrence; time-decay factors creating position-aware token mixing; RWKV-5/6 improvements (multi-headed WKV, better gating); transformer-quality training parallelism with RNN-efficiency inference; comparison with Mamba on standard benchmarks
  • 🔬 Recent sparse attention advances: ring attention for distributed long-context across multiple GPUs (sequence parallelism); blockwise parallel decoding; learned sparse patterns vs. fixed patterns; theoretical framework for which sparse patterns preserve expressiveness vs. lose information
  • 🔬 Gated Attention (NeurIPS 2025 Best Paper): applying a learnable sigmoid gate after scaled dot-product attention; enables non-linearity, sparsity, and attention-sink-free inference; deployed in Qwen3-Next; Gated DeltaNet combines gated linear attention with gated softmax attention for hybrid architectures
  • 🔬 Attention architecture evolution: MHA (original) → GQA (shared KV heads for cache reduction) → MLA (low-rank KV projection) → Gated Attention (sigmoid gate for sparsity) → hybrid architectures combining softmax attention with linear attention (Gated DeltaNet, Jamba); each step trades expressiveness for efficiency
  • GPU architecture internals: streaming multiprocessors (SMs), warps (32 threads), thread blocks
  • Memory hierarchy: registers → SRAM (shared memory, ~20MB) → HBM (global, ~80GB) → host DRAM
  • Memory bandwidth vs. compute: arithmetic intensity and the roofline model
  • Why attention is memory-bound: the IO complexity analysis
  • FlashAttention internals: tiling the QKT computation, online softmax algorithm, avoiding materialization of the n×n attention matrix
  • FlashAttention-2/3: warp-level optimizations, FP8 support
  • Kernel fusion: combining operations to reduce memory round-trips
  • Triton: writing custom GPU kernels in Python: matrix multiply, fused attention
  • Resource accounting: FLOPs per token = 6ND (forward+backward), memory = 2P (params) + optimizer states + activations + KV cache (KV cache: stored key/value tensors from previous tokens that avoid recomputation during generation; covered in depth in Module 8.2)
  • Lab: Write a simple Triton kernel for fused softmax; benchmark against PyTorch native; calculate FLOPs and memory for a 7B model training run
  • 🔬 Universal approximation results for transformers
  • 🔬 What fixed-depth transformers can compute (bounded-depth threshold circuits, TC^0)
  • 🔬 Why chain-of-thought extends computational power (chain-of-thought: prompting the model to show intermediate reasoning steps before the final answer; covered in depth in Module 10.2) (transformers + CoT can simulate arbitrary Turing machines)
  • 🔬 Depth vs. width tradeoffs for expressiveness
  • 🔬 Implications: some problems provably require CoT (they exceed the computational class of single-pass transformers)

Understand how LLMs generate text token-by-token. Master the algorithms that control quality, diversity, and speed of generation: from greedy search to speculative decoding. Identified as a gap vs. Stanford CS336 and CMU ANLP.

  • Greedy decoding: simplest but suboptimal
  • Beam search: exploring multiple hypotheses
  • Length normalization and length penalty
  • Constrained beam search: forcing specific tokens/patterns
  • Lab: Implement greedy and beam search from scratch; compare output quality on summarization
5.2 Stochastic Sampling Methods 🟡⚙️🔧
  • Temperature scaling: sharpening and flattening distributions
  • Top-k sampling
  • Nucleus (top-p) sampling
  • Min-p sampling: adaptive threshold
  • Typical decoding and eta sampling
  • Repetition penalty, frequency penalty, presence penalty
  • Lab: Implement all sampling methods; generate text at various temperatures and visualize token probability distributions
  • 🔬 Contrastive decoding: amateur vs. expert model
  • 🔬 Classifier-free guidance for language models
  • Grammar-constrained decoding (Outlines, Guidance, LMQL)
  • JSON schema enforcement at the logit level
  • Watermarking generated text: detection and robustness
  • 🔬 Minimum Bayes Risk (MBR) decoding: sample N candidates, select the one minimizing expected risk under a utility metric (e.g., LLM-judge score, ROUGE, BERTScore); outperforms greedy and best-of-N decoding (ICLR 2025); practical tradeoff: N samples × utility evaluation cost vs. quality gain
  • 🔬 Discrete diffusion for text: MDLM, SEDD, LLaDA, Dream
  • The forward process adds noise to token embeddings; reverse process denoises to generate
  • Parallel token generation (all tokens simultaneously, not autoregressive)
  • 🔬 Gemini Diffusion paradigm
  • Advantages: order-of-magnitude latency reduction for long outputs
  • Limitations: quality gap vs. autoregressive for complex reasoning
  • 🔬 TraceRL (ICLR 2026): RL post-training for diffusion LLMs
Part II: Understanding LLMs

Understand how LLMs are trained at scale: pre-training objectives, data curation pipelines, scaling laws, and the computational infrastructure behind modern foundation models. Expanded with deeper treatment of scaling laws and data curation per Stanford CS336.

6.1 The Landmark Models 🟢🔬
  • BERT and its variants (RoBERTa, DeBERTa, ALBERT)
  • GPT series: GPT-1 → GPT-2 → GPT-3 → InstructGPT → GPT-4
  • T5 and the text-to-text framework
  • Emergence: in-context learning, chain-of-thought reasoning
  • Causal language modeling (CLM): next-token prediction (GPT family)
  • Masked language modeling (MLM): BERT, RoBERTa
  • Span corruption / denoising: T5, UL2
  • Prefix LM: PaLM, GLM
  • Fill-in-the-middle (FIM) for code models
  • 🔬 Multi-token prediction: training models to predict multiple future tokens simultaneously (Meta, 2024); architecture: shared trunk with N independent prediction heads; benefits: improved sample efficiency, better representations of long-range dependencies, natural fit for speculative decoding; used in DeepSeek V3 training; challenges: increased memory during training, diminishing returns beyond 4 tokens
  • Kaplan scaling laws: loss as a function of N, D, C
  • Chinchilla laws: compute-optimal data/parameter ratios
  • Data-constrained scaling: what happens when you run out of data?
  • Over-training small models (Llama approach): trading compute for inference cost
  • Predicting loss from compute budget: practical use of scaling laws
  • Emergent capabilities and phase transitions
  • 🔬 The emergent abilities debate: Schaeffer et al. (2023) argued emergent abilities are a "mirage" caused by nonlinear metric choices (switching from accuracy to log-likelihood makes transitions smooth); counterarguments: some capabilities genuinely appear discontinuously on continuous metrics; implications for AI safety: if capabilities are unpredictable, governance is harder; implications for scaling decisions: if capabilities are smooth, we can extrapolate
  • Lab: Fit scaling law curves to mini-model training runs; predict loss for a target model size
6.4 Data Curation at Scale 🔴⚙️🔧
  • Pre-training data sources: Common Crawl, The Pile, RedPajama, FineWeb, DCLM
  • Web crawling and text extraction pipelines
  • Deduplication: exact (hash), near-duplicate (MinHash/SimHash), fuzzy
  • Quality filtering: heuristic rules, perplexity scoring, classifier-based
  • Data mixing: domain proportions and their impact on capabilities
  • Toxicity and PII removal at scale
  • 🔬 Data pruning: removing low-value training examples to reduce compute without quality loss; influence functions: tracing model predictions back to specific training examples (which training points most affect this output?); TRAK and datamodels for efficient attribution at scale; membership inference attacks as attribution tools; connection to copyright litigation (NYT v. OpenAI) and GDPR data subject requests; practical use: debugging model failures by identifying problematic training data
  • Lab: Build a mini data curation pipeline: crawl → extract → deduplicate → filter → quality-score using FineWeb tools
  • Adam optimizer internals: first/second moment estimation, bias correction, memory cost (2× model params)
  • AdamW: decoupled weight decay: why it matters for transformers
  • Memory-efficient optimizers: Adafactor (factored second moments), 8-bit Adam, LION (sign-based)
  • Learning rate schedules: warmup necessity (preventing early divergence), cosine decay with restarts
  • Gradient accumulation: simulating large batch sizes: interaction with batch norm and LR
  • Training dynamics: loss landscape geometry, sharp vs. flat minima, grokking phenomenon
  • Training instabilities: loss spikes, NaN gradients: root causes and mitigations (z-loss, gradient clipping)
  • Collective communication primitives: all-reduce, all-gather, reduce-scatter: ring vs. tree topologies
  • Data parallelism (DDP): replicated model, gradient all-reduce, synchronized SGD
  • Fully Sharded Data Parallelism (FSDP): parameter sharding, forward/backward gather/scatter lifecycle
  • ZeRO optimization stages: Stage 1 (optimizer states) → Stage 2 (+gradients) → Stage 3 (+parameters)
  • Tensor parallelism: column/row splitting of linear layers, all-reduce placement
  • Pipeline parallelism: micro-batching, 1F1B schedule, pipeline bubbles
  • Mixed precision: FP16 (loss scaling needed), BF16 (range preserved), FP8 (Hopper GPUs)
  • FP8 training at scale: DeepSeek V3 demonstrated successful FP8 mixed-precision training at 671B parameters (first large-scale demonstration); E4M3 for forward pass activations, E5M2 for gradients; per-tensor dynamic scaling to prevent overflow; 2x memory reduction and higher throughput vs. BF16 with minimal quality loss
  • Gradient checkpointing: recomputing activations to trade compute for memory: optimal checkpoint placement
  • Data loading pipeline: tokenized data sharding, weighted sampling across domains, curriculum
  • Lab: Train a small model with FSDP across multiple GPUs; compare DDP vs. FSDP memory footprint; profile communication overhead
  • 🔬 The mystery: how do transformers learn from examples in the prompt without gradient updates?
  • 🔬 Transformers as implicit meta-learners: the Bayesian interpretation (Xie et al. 2022)
  • 🔬 In-context learning as implicit gradient descent (Akyurek et al. 2023, Von Oswald et al. 2023)
  • 🔬 Task vectors: how in-context examples shift internal representations toward task-relevant subspaces
  • 🔬 Mesa-optimization: are transformers learning optimization algorithms internally?
  • Limitations: when in-context learning fails (distribution shift, complex reasoning, long contexts)
  • Connection to few-shot prompting practice: why example selection and ordering matter

Survey the current state of LLMs, both closed and open-source, and understand the architectural innovations, reasoning capabilities, and multilingual dimensions of modern models.

  • OpenAI: GPT-4o, o1/o3: reasoning models and chain-of-thought
  • Anthropic: Claude 3.5 Sonnet, Claude 4 family: constitutional AI, long context
  • Google: Gemini 2.0 / 2.5: native multimodality, million-token context, "thinking" mode
  • xAI Grok, Cohere Command R+, Mistral Large: second-tier frontier models
  • Comparing capabilities, pricing tiers, rate limits, and context windows
  • Meta Llama 3 / 3.1 / 4: architecture, training, chat fine-tuning
  • Mistral, Mixtral (MoE), Mistral Large
  • Google Gemma 2 / 3
  • Qwen 2.5, DeepSeek-V3 / R1: MoE and reasoning
  • DeepSeek V3 architecture innovations: Multi-head Latent Attention (MLA) compresses KV cache by projecting keys/values into a low-rank latent space, reducing cache by 10x+; FP8 mixed-precision training at 671B parameters (first successful large-scale FP8 training); auxiliary-loss-free MoE load balancing using bias terms instead of loss penalties; multi-token prediction training objective (predict next N tokens simultaneously)
  • Microsoft Phi-3 / Phi-4: small but capable models via knowledge distillation
  • Recent: Llama 4 (MoE, native multimodal), Gemma 3 (vision), DeepSeek-R1 (open reasoning)
  • Specialized: CodeLlama, StarCoder2, Whisper, LLaVA
  • The Hugging Face ecosystem: Model Hub, Transformers, Datasets, Spaces
  • Lab: Download and run Llama 3 8B locally; compare output quality with a 70B model via API
  • Inference-time scaling: the paradigm shift from train-time to test-time compute
  • Chain-of-thought at scale: o1/o3, DeepSeek-R1 internals
  • Process reward models (PRMs) vs. outcome reward models (ORMs)
  • Best-of-N sampling with reward-guided selection
  • Monte Carlo Tree Search for language: LATS, AlphaProof approach
  • Compute-optimal inference: when to think longer vs. use a bigger model
  • Lab: Implement best-of-N with a reward model; compare accuracy vs. compute on math reasoning tasks
  • Multilingual pre-training: cross-lingual transfer, curse of multilinguality
  • Low-resource language challenges and solutions
  • Cultural bias in LLMs: Western-centric defaults, evaluation across cultures
  • Multilingual evaluation benchmarks and metrics
  • Adapting English-centric models to new languages (continued pre-training, vocabulary extension)

Master the techniques that make LLM inference fast and affordable: from quantization and KV cache optimization to speculative decoding and high-throughput serving. Identified as a gap vs. Stanford CS336 and CMU ANLP.

8.1 Model Quantization 🟡⚙️🔧
  • Quantization math: mapping float → int: absmax (symmetric: q = round(x / max|x| × 2n-1)), zero-point (asymmetric: shift + scale), per-tensor vs. per-channel vs. per-group granularity
  • Data types: INT8, INT4, FP8 (E4M3, E5M2), NF4 (normal-float: quantile-based 4-bit, optimal for normally-distributed weights)
  • Calibration strategies: how to choose quantization parameters: min/max, percentile, MSE-minimizing, cross-entropy-minimizing
  • Post-training quantization: GPTQ (layer-wise Hessian-based optimal rounding), AWQ (activation-aware: protect salient weight channels), GGUF (llama.cpp format, mixed-precision per tensor)
  • Quantization-aware training: simulated quantization during forward pass, straight-through estimator for gradients
  • bitsandbytes: 4-bit and 8-bit loading with automatic mixed-precision; NF4 + double quantization for QLoRA
  • Quality degradation analysis: perplexity vs. bit width curves, task-specific sensitivity, outlier features
  • Lab: Quantize a 7B model to 4-bit with GPTQ and AWQ; compare perplexity, generation quality, inference speed, and memory at INT8/INT4/NF4
  • The KV cache explained: storing key/value tensors from all previous tokens to avoid recomputation
  • KV cache data structure: tensor of shape [batch, num_heads, seq_len, head_dim] per layer: memory formula: 2 × layers × heads × seq_len × head_dim × dtype_size
  • Why inference is memory-bandwidth-bound: low arithmetic intensity during generation
  • PagedAttention internals: virtual memory analogy: block tables map logical KV positions to physical GPU memory blocks; eliminates fragmentation and enables memory sharing across sequences
  • KV cache compression: INT8/INT4 quantization of cached values, H2O eviction (Heavy-Hitter Oracle), sliding window attention, StreamingLLM (attention sinks)
  • MQA vs. GQA vs. MHA: sharing K,V heads reduces cache by N×; GQA as the modern compromise (Llama 2/3)
  • Prefix caching: RadixAttention tree for sharing cached prefixes across requests: data structure and lookup
  • Continuous batching: dynamically adding/removing sequences mid-batch: iteration-level vs. request-level scheduling
  • 🔬 Test-Time Training (TTT): compressing long context into model weights via continued next-token-prediction at inference time; TTT layers replace attention with a learned update rule applied during inference; achieves 35x speedup over full attention at 2M context; blurs the line between training and inference
  • 🔬 DeepSeek Sparse Attention (DSA): hierarchical two-stage sparse attention pipeline (Lightning indexer for coarse selection, then fine-grained token selection); reduces inference cost by approximately 70% for long contexts; introduced in DeepSeek V3.2
  • Lab: Calculate KV cache size for Llama 3 8B/70B at various context lengths; profile memory with vLLM; implement prefix caching and measure throughput gain
8.3 Speculative Decoding 🔴⚙️🔧
  • Speculative decoding principle: draft γ tokens with fast model, verify all γ in a single forward pass of target model: mathematically guaranteed to match target distribution
  • Acceptance/rejection: compare draft token probabilities p(x) with target q(x); accept with probability min(1, q(x)/p(x)); reject and resample from adjusted distribution
  • Draft model selection: separate small model, self-speculative (layer skipping), n-gram lookup, retrieval-based
  • EAGLE: feature-level autoregression: predicting hidden states, not tokens; tree-structured verification for parallel candidate evaluation
  • Medusa: multiple prediction heads on top of target model: each head predicts k-th future token
  • Token tree verification: batched verification of multiple candidate sequences in a single forward pass using tree attention masks
  • When speculative decoding helps: high draft acceptance rate (>70%), latency-sensitive single-request, target model is bandwidth-bound
  • Lab: Implement speculative decoding from scratch with rejection sampling; benchmark speedup with different draft models; measure acceptance rates
8.4 Serving Infrastructure 🟡⚙️🔧
  • vLLM: high-throughput serving with continuous batching
  • TGI (Text Generation Inference) by Hugging Face
  • SGLang: optimized runtime with RadixAttention
  • TensorRT-LLM: NVIDIA's inference engine with hardware-level GPU optimization; 30-50% higher throughput than vLLM at high concurrency
  • LMDeploy: inference engine with TurboMind backend; competitive quantization support
  • Ollama and llama.cpp for local inference
  • Triton Inference Server for production
  • Benchmarking: throughput (tokens/sec), latency (TTFT, TPS), concurrency
  • Lab: Deploy vLLM and TGI side-by-side; benchmark throughput and latency under load
Part III: Working with LLMs
Module 09

Working with LLM APIs

Master the practical skills of calling, configuring, and optimizing LLM APIs from all major providers.

9.1 OpenAI API Deep Dive 🟢⚙️🔧
  • Chat Completions API: messages, roles (system/user/assistant), parameters
  • Temperature, top_p, max_tokens, frequency/presence penalty
  • Streaming responses with SSE
  • Function calling / tool use
  • Structured Outputs (JSON mode, response_format)
  • Batch API for cost reduction
  • Lab: Build a multi-turn chatbot with function calling using the OpenAI Python SDK
  • Anthropic Messages API: system prompts, prompt caching, tool use, extended thinking
  • Google Gemini API: generateContent, grounding, code execution
  • AWS Bedrock: unified access to multiple model providers
  • Azure OpenAI: enterprise deployment patterns
  • API comparison: feature parity, pricing, rate limits
  • Lab: Implement the same task across OpenAI, Anthropic, and Gemini APIs; compare results and cost
  • LiteLLM: unified interface for 100+ LLM providers
  • OpenAI-compatible APIs: standardization pattern
  • OpenRouter: model routing and fallback
  • Cost tracking, rate limiting, and retry strategies
  • Production LLM error handling patterns: circuit breaker pattern (failover when provider returns errors for extended periods); timeout management (separate TTFT timeout from total generation timeout); error taxonomy: 429 (rate limit, exponential backoff with jitter), context length exceeded (truncate and retry), content filter triggered (rephrase), malformed tool call JSON (retry with stricter schema); graceful degradation (cached responses, simpler model fallback, static FAQ when LLM unavailable)
  • Caching strategies: semantic caching, prompt caching
  • Semantic cache implementation: embed incoming query, similarity search against cached query-response pairs (cosine threshold typically 0.95+), return cached response if match found; cache invalidation strategies (TTL, source document change detection); tools: GPTCache, Redis with vector search
  • Token budget enforcement: per-user/organization token tracking, hard/soft spending limits, cost alerting on anomalous usage spikes, per-feature cost attribution dashboards
  • AI gateways for production: Portkey (routing, fallbacks, spend tracking, caching, guardrails across 1600+ LLMs), Helicone (open-source observability proxy with request logging and cost tracking)
  • Lab: Build a provider-agnostic LLM client with automatic fallback and cost tracking

Prompting is programming with natural language. Learn systematic techniques from basic few-shot to advanced reasoning chains, reflection patterns, and automated prompt optimization.

  • Zero-shot, one-shot, and few-shot prompting
  • System prompts and role assignment
  • Instruction clarity: specificity, constraints, output format
  • Prompt templates and variable injection
  • Handling edge cases: refusals, hallucinations, verbosity
  • Lab: Iteratively refine prompts for a classification task; measure accuracy improvements
10.2 Advanced Reasoning Strategies 🟡⚙️🔧
  • Chain-of-Thought (CoT) prompting and its variants
  • Self-consistency: sampling multiple reasoning paths and majority voting
  • Tree-of-Thought (ToT): structured exploration with backtracking
  • Step-back prompting: abstraction before reasoning
  • Meta-prompting and prompt chaining
  • Lab: Implement CoT, self-consistency, and ToT for math reasoning; compare accuracy
  • Reflection as a first-class design pattern (per Andrew Ng's framework) (see also Module 21.1 for reflection as an agentic architecture pattern)
  • Self-evaluation: having the LLM critique its own output
  • Iterative refinement loops: generate → critique → revise
  • Constitutional AI-style self-checks at prompt-time
  • Reflexion: memory-augmented self-reflection over multiple attempts
  • When reflection helps vs. when it's compute-wasteful
  • Lab: Build a reflection loop for code generation: generate → test → reflect on errors → fix; measure pass@1 improvement
  • JSON mode and schema enforcement
  • Pydantic models for output validation (Instructor library)
  • Automatic prompt optimization: DSPy, OPRO
  • Prompt versioning, A/B testing, and regression testing
  • Lab: Use Instructor + Pydantic to extract structured data; then use DSPy to auto-optimize a multi-step prompt pipeline

In production, LLMs rarely work alone. Learn when to use an LLM vs. classical ML, how to combine them in hybrid architectures, and how to make principled cost-performance tradeoffs. Addresses the #1 gap identified across all three executive perspectives.

11.1 When NOT to Use an LLM 🟡⚙️🔧
  • The LLM decision framework: accuracy vs. latency vs. cost vs. interpretability: when classical ML wins
  • Classification: TF-IDF + logistic regression at 0.001x cost vs. GPT-4: when each is appropriate
  • Named Entity Recognition: spaCy/CRF vs. LLM extraction: speed and accuracy tradeoffs (see Module 11.5 for full IE treatment)
  • Tabular prediction: XGBoost/LightGBM vs. LLM: structured data is still king for classical ML
  • Regex and rule-based extraction: when deterministic rules beat stochastic LLM outputs
  • Cost modeling: calculating per-query cost at scale for LLM vs. classical approaches ($0.001 vs. $0.00001)
  • Lab: Benchmark the same classification task with TF-IDF+LR, fine-tuned BERT, GPT-4 few-shot, and fine-tuned Llama: compare accuracy, latency, cost, and reliability
11.2 Hybrid ML+LLM Architectures 🔴⚙️🔧
  • Pattern: LLM as feature extractor: use LLM to generate embeddings or structured features, feed into XGBoost/neural net for final prediction
  • Pattern: Classical triage → LLM escalation: cheap model handles 80% of cases, LLM handles the complex 20%
  • Pattern: LLM-powered feature engineering: generate text descriptions of structured data, enrich sparse features with LLM reasoning
  • Pattern: Ensemble: classical model + LLM vote, confidence-weighted combination
  • Pattern: LLM → structured pipeline: LLM extracts entities/intent, downstream classical system executes (e.g., NLU → slot-filling → API call)
  • Pattern: Classical NLP pre-filter + LLM: regex/keyword filter reduces candidates, LLM does semantic analysis on survivors
  • Cascading model architectures: small model → medium model → large model with confidence-based routing
  • Lab: Build a customer support system where a classifier routes tickets, an LLM extracts structured info from complex ones, and a rules engine executes the resolution
  • 🔬 LLM-native time series models: TimeGPT, Chronos (Amazon), Lag-Llama, Moirai: architectures and capabilities
  • 🔬 Zero-shot forecasting: pre-trained time series foundation models vs. ARIMA/Prophet
  • LLM-powered anomaly explanation: detecting anomalies with classical methods, explaining them with LLMs
  • Multimodal time series: combining numerical data with text context (news, reports) for enriched forecasting
  • Limitations: when statistical models still dominate (short series, simple seasonality, high-frequency data)
  • Total Cost of Ownership (TCO) modeling: API costs + infrastructure + engineering time + maintenance
  • LLM cost optimization patterns: prompt caching, semantic caching, model routing (small→large), batch processing
  • Latency budgets: decomposing end-to-end latency across retrieval, LLM inference, and post-processing
  • Quality-cost Pareto frontier: plotting accuracy vs. cost for different model configurations
  • Build vs. buy analysis: self-hosted open-source vs. API provider: breakeven calculations based on volume
  • Lab: Build a model router that sends simple queries to a small model and complex queries to GPT-4; measure cost savings vs. quality loss
  • The IE task landscape: Named Entity Recognition (NER), relation extraction, event extraction, coreference resolution, slot filling
  • Classical IE pipeline: rule-based → CRF/BiLSTM-CRF → fine-tuned BERT for NER; spaCy, Flair, Stanza
  • LLM-based IE: zero-shot and few-shot extraction with structured output (JSON mode); prompt design for entity and relation extraction
  • Hybrid IE: classical NER for high-recall extraction, LLM for disambiguation, normalization, and complex relations
  • Structured output enforcement: Pydantic models, JSON schema constraints, Instructor library, BAML
  • Evaluation: entity-level F1 (strict vs. partial match), relation extraction metrics, error analysis patterns
  • Production IE patterns: batch extraction from document corpora, incremental knowledge base population, quality monitoring
  • Lab: Build an IE pipeline that extracts entities and relationships from the project dataset using both spaCy NER and LLM few-shot extraction; compare precision, recall, and cost
Part IV: Training & Adapting

Synthetic data is the backbone of this course's project. Learn to generate high-quality, diverse, and domain-specific datasets, and use LLMs as simulators for evaluation and testing.

  • Why synthetic data: cost, privacy, coverage, scale
  • Types: instruction data, conversation data, preference pairs, domain data
  • Quality dimensions: diversity, accuracy, consistency, naturalness
  • Risks: model collapse, bias amplification, data contamination
  • 🔬 LLM output homogeneity problem (NeurIPS 2025): studies across 70+ models reveal pronounced intra-model and inter-model homogenization of creative content; implications for synthetic data (model collapse risk when training on LLM-generated data); mitigation: diversity-promoting decoding, temperature tuning, persona-driven generation
  • Legal and ethical considerations
  • Self-Instruct and Evol-Instruct (WizardLM) approaches
  • Generating instruction-response pairs with seed tasks
  • Multi-turn conversation synthesis
  • Persona-driven generation for diversity
  • Domain-specific data generation strategies
  • Using LLMs to generate preference/ranking data (for RLHF/DPO, covered in Module 16)
  • Lab: Build a pipeline to generate 10K synthetic customer support conversations using persona templates and quality filters
  • Simulating users: generating realistic interaction patterns
  • Synthetic test set generation for RAG evaluation
  • Red-teaming data generation: adversarial prompt synthesis
  • Synthetic A/B test scenarios for LLM applications
  • LLM-based evaluation harness generation
  • Lab: Generate a synthetic evaluation suite for the project: test questions, expected answers, edge cases, and adversarial inputs
  • Automated quality scoring with LLM-as-judge
  • Deduplication: exact, near-duplicate (MinHash), semantic
  • Filtering: length, language, toxicity, topic relevance
  • Argilla for data labeling and review
  • Distilabel for scalable synthetic data pipelines
  • Lab: Build a quality-scored synthetic data pipeline using Distilabel; curate a fine-tuning dataset
  • LLM pre-labeling: using LLMs to generate initial labels for human review: 5-10x annotation speedup
  • Confidence-based routing: LLM labels high-confidence samples automatically, humans label uncertain ones
  • Active learning with LLMs: selecting the most informative samples for human annotation using uncertainty sampling and diversity sampling
  • Annotation tools: Label Studio, Prodigy, Argilla: LLM integration patterns
  • Annotation guideline generation: using LLMs to draft and iterate on labeling instructions
  • Quality control: inter-annotator agreement (Cohen's κ), LLM-vs-human agreement tracking, label noise detection
  • Lab: Build an LLM-in-the-loop labeling pipeline: LLM pre-labels → confidence routing → human review in Argilla → fine-tuning dataset
  • Weak supervision fundamentals: labeling functions, noise-aware models, and the Snorkel paradigm
  • Writing labeling functions: heuristics, pattern matching, knowledge bases, pre-trained models as weak sources
  • Label aggregation: majority voting, generative label models, handling conflicts and abstentions
  • Combining weak supervision with LLM-generated labels for scalable annotation
  • When to use weak supervision vs. LLM labeling vs. human annotation: cost and quality tradeoffs
Project Milestone: Generate the synthetic conversational dataset (10K+ examples) that will be used throughout the rest of the course for fine-tuning, RAG, and agent building. Include multi-turn dialogues, tool-use examples, preference pairs, and evaluation test sets.
Module 13

Fine-Tuning Fundamentals

Learn the complete workflow of fine-tuning LLMs: from data preparation and formatting to training, monitoring, and evaluating adapted models.

  • Prompting vs. RAG vs. fine-tuning: decision framework
  • Use cases: style/tone, domain knowledge, output format, latency, cost
  • Full fine-tuning vs. parameter-efficient methods
  • Catastrophic forgetting and how to mitigate it
  • Continual pre-training vs. instruction fine-tuning
  • Dataset formats: Alpaca, ShareGPT, ChatML, conversational
  • Chat templates and tokenizer configuration
  • Train/validation/test splits for LLMs
  • Data mixing and balancing strategies
  • Packing sequences for efficient training
  • Lab: Prepare the synthetic dataset from Module 12 into Hugging Face Datasets format with proper chat templates
13.3 Supervised Fine-Tuning (SFT) 🟡⚙️🔧
  • Full fine-tuning with Hugging Face Trainer / TRL
  • Hyperparameters: learning rate, batch size, warmup, weight decay, epochs
  • Learning rate schedulers: cosine, linear, constant with warmup
  • Gradient accumulation for large effective batch sizes
  • Monitoring with Weights & Biases, TensorBoard
  • Lab: Fine-tune a Llama 3 8B model on synthetic data using TRL's SFTTrainer; track metrics in W&B
13.4 Fine-Tuning via Provider APIs 🟡⚙️🔧
  • OpenAI fine-tuning API: data format, training, deployment
  • Google Vertex AI model tuning
  • Trade-offs: ease vs. control vs. cost
  • Lab: Fine-tune GPT-4o-mini via OpenAI API on synthetic data; compare with locally fine-tuned model
  • Why fine-tune for representations: domain shift, specialized similarity, clustering quality
  • Choosing the base model: encoder-only (BERT family) vs. decoder-only (LLM2Vec approach) for embeddings
  • When to fine-tune embeddings vs. use off-the-shelf: domain specificity thresholds
  • Full treatment of embedding training (losses, hard negatives, Sentence-Transformers API, labs) is in Module 18.1
  • Adding classification heads to pre-trained models: linear probe vs. full fine-tuning
  • Single-label classification: sentiment, intent, topic; multi-label classification: tagging, multi-intent detection
  • Token classification: NER, POS tagging; adding per-token classification heads
  • Sequence-pair tasks: entailment, similarity, question-answer relevance
  • Practical considerations: class imbalance (weighted loss, oversampling), threshold tuning for multi-label, calibration
  • Hugging Face AutoModelForSequenceClassification, AutoModelForTokenClassification: practical API walkthrough
  • Lab: Fine-tune BERT for intent classification and a decoder model (Llama) for the same task; compare accuracy, latency, and cost
13.7 Adapting Models for Long Text 🔴⚙️🔧
  • The long context challenge: why models trained on 4K tokens struggle at 32K+
  • Context extension techniques: RoPE scaling (linear, NTK-aware, YaRN), position interpolation, dynamic NTK
  • Continued pre-training for long context: LongRoPE, LongLoRA approaches (LoRA is introduced in Module 14.1)
  • Chunking strategies for long documents: hierarchical processing, map-reduce summarization, sliding window with overlap
  • Lost-in-the-middle phenomenon: why models attend poorly to middle context; mitigation strategies (reordering, recursive summarization)
  • Practical tradeoffs: memory scaling (O(n^2) attention), inference latency at long contexts, quality degradation curves
  • Llama 4 Scout 10M token context window: architectural innovations enabling extreme context (iRoPE: interleaved RoPE with some layers using no positional encoding, enabling infinite context extrapolation); early-fusion multimodal approach processing images and text jointly from the first layer
  • Lab: Compare model performance on a QA task at 4K, 16K, and 64K context lengths; implement chunking and map-reduce as alternatives to long-context models

Train large models on consumer hardware by only updating a fraction of parameters. Master LoRA, QLoRA, and other PEFT methods that democratize fine-tuning.

14.1 LoRA & QLoRA 🟡⚙️🔧
  • Low-Rank Adaptation math: W' = W + BA where B ∈ ℝd×r, A ∈ ℝr×d: freezing W, training only B and A
  • Why low-rank works: weight update matrices during fine-tuning have low intrinsic rank (Aghajanyan et al.)
  • Rank (r): tradeoff between capacity and efficiency: typically r=8-64 vs. d=4096
  • Alpha (α) and scaling: α/r scaling factor: why it matters for learning rate transfer across ranks
  • Target modules: which linear layers to adapt (q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj)
  • QLoRA internals: NF4 data type (quantile-based 4-bit), double quantization (quantizing the quantization constants), paged optimizers for memory spikes
  • Merging: W_merged = W + (α/r) × BA: lossless for inference, no additional latency
  • Hugging Face PEFT library: config, model wrapping, saving/loading adapters
  • Lab: Fine-tune Llama 3 8B with QLoRA on a single GPU; inspect adapter weight matrices; merge and compare quality with full fine-tune
14.2 Advanced PEFT Methods 🔴⚙️
  • DoRA: Weight-Decomposed Low-Rank Adaptation
  • LoRA+: improved learning rate scheduling for LoRA
  • Prefix Tuning, P-Tuning: prepending trainable embeddings to hidden states
  • Prompt Tuning in depth: learning soft prompt tokens that are prepended to the input; comparison with discrete prompt search; scaling behavior showing prompt tuning matches fine-tuning as model size grows
  • Adapter layers (Houlsby, Pfeiffer)
  • IA3 (Infused Adapter by Inhibiting and Amplifying)
  • Multi-adapter serving: LoRAX, S-LoRA
  • Choosing the right PEFT method for your use case
14.3 Training Platforms & Tools 🟡⚙️🔧
  • Unsloth: 2x faster fine-tuning with memory optimization
  • Axolotl: configuration-driven fine-tuning
  • LLaMA-Factory: web UI for fine-tuning
  • torchtune: PyTorch-native fine-tuning library with memory-efficient recipes for LoRA/QLoRA on consumer GPUs (24GB VRAM)
  • TRL (Transformer Reinforcement Learning) library
  • Cloud training: Google Colab, Lambda Labs, RunPod, Modal
  • Lab: Use Unsloth to fine-tune Mistral 7B with QLoRA in under 30 minutes on a free Colab GPU

Create smaller, faster models that retain the capabilities of larger ones. Learn distillation techniques and model merging strategies that are widely used in the open-source LLM community. Identified as a gap: core technique behind Phi, Orca, distilled DeepSeek-R1.

  • Classical distillation: teacher-student framework, soft targets, temperature
  • Black-box distillation: distilling from API-only models via synthetic data
  • White-box distillation: logit matching, intermediate layer matching
  • Case studies: Orca (progressive learning from GPT-4), Phi (textbook-quality data), distilled DeepSeek-R1
  • Speculative knowledge distillation: training draft models for speculative decoding
  • Legal and licensing considerations of distillation
  • ⚙️ Small-but-capable model research: the Phi series (Microsoft) demonstrating data quality over quantity; key innovations: synthetic data curriculum, targeted capability training, careful data mixing; Gemma 3 (Google), SmolLM (Hugging Face), Qwen2.5-Coder: similar principles at different scales; implications: 4B models rivaling 70B on specific tasks when trained with the right data; practical relevance: deployment on edge devices, mobile, and cost-constrained environments
  • Lab: Distill a 70B model's reasoning capabilities into a 7B model via synthetic data generation and SFT
15.2 Model Merging & Composition 🔴⚙️🔧
  • Model merging intuition: combining strengths of multiple fine-tunes
  • Merging methods: Linear, SLERP, TIES, DARE, Model Stock
  • Task arithmetic: adding and subtracting task vectors
  • Model soups: averaging multiple checkpoints
  • MergeKit: practical model merging toolkit
  • Evolutionary model merging: Sakana AI's approach
  • Lab: Merge two LoRA fine-tunes (one for code, one for chat) using SLERP and TIES; evaluate the combined model
  • Continual pre-training on domain-specific corpora
  • Vocabulary extension for new domains/languages
  • Replay-based methods to prevent catastrophic forgetting
  • Elastic Weight Consolidation (EWC) and related techniques
  • Progressive training: curriculum and staged approaches
Project Milestone: Fine-tune a model on the synthetic dataset using QLoRA. Optionally distill or merge with a reasoning adapter. Upload the adapter to Hugging Face Hub.

Align LLMs with human preferences using reinforcement learning and direct optimization methods.

  • 📐 The three-stage alignment pipeline: SFT → Reward Model → PPO
  • Reward model architecture: same transformer backbone with scalar head; trained on preference pairs (chosen, rejected)
  • 📐 Bradley-Terry model: P(y₁ ≻ y₂) = σ(r(y₁) - r(y₂)): converting preferences to reward signal
  • PPO for LLMs: policy = the LM, action = next token, reward = RM score; clipped objective to prevent large policy updates
  • KL divergence penalty: D_KL(π_θ || π_ref): preventing reward hacking and maintaining base model capabilities
  • Process Reward Models (PRMs): reward per reasoning step vs. Outcome Reward Models (ORMs): reward on final answer only
  • GRPO (Group Relative Policy Optimization): DeepSeek's approach: no separate reward model, group-relative advantages
  • RLHF infrastructure: separate processes for generation, reward scoring, and training: distributed architecture
  • DPO derivation: reparametrizing the RLHF objective to eliminate the reward model: loss = -log σ(β(log π(y_w)/π_ref(y_w) - log π(y_l)/π_ref(y_l)))
  • DPO internals: implicit reward model, reference model frozen, β controls deviation from reference policy
  • KTO (Kahneman-Tversky Optimization): works with binary feedback (good/bad) instead of preference pairs: loss-averse weighting
  • ORPO (monolithic preference tuning without reference model), SimPO (length-normalized rewards), IPO (identity preference optimization)
  • Creating preference datasets: chosen vs. rejected pairs
  • Using synthetic preferences from stronger models
  • Lab: Train a DPO adapter on synthetic preference data using TRL's DPOTrainer
  • Anthropic's Constitutional AI (CAI) approach
  • RLAIF: AI feedback instead of human feedback
  • Self-play and iterative self-improvement
  • Alignment tax and capability-alignment tradeoffs
  • 🔬 Shallow safety alignment (ICLR 2025 Outstanding Paper): safety training adapts only the first few output tokens of LLM responses; this explains why fine-tuning attacks, prefilling attacks, and adversarial suffix attacks succeed at bypassing safety; implications: need for deepened alignment across all generation steps, regularized fine-tuning objectives
  • The RLVR paradigm: training reasoning models using automatically verifiable rewards (math correctness, code execution, formal proofs) instead of human feedback
  • Why RLVR works without human annotators: verifiable reward signals provide exact supervision
  • GRPO (Group Relative Policy Optimization) as the core algorithm: relative advantage within sampled groups
  • ⚙️ DeepSeek-R1 training pipeline: cold start SFT, then RLVR on math/code, then rejection sampling, then full SFT, then final RLVR
  • Extension beyond math/code: RLVR for chemistry, biology, structured reasoning tasks
  • RLVR extensions: AlphaProof for mathematical proof verification, DeepSeek-Prover-V2 for formal theorem proving, code execution feedback for programming tasks; emerging: RLVR for chemistry (molecular property verification), biology (protein structure validation), and multi-step tool use (action outcome verification)
  • ⚙️ The open reasoning model ecosystem: QwQ, Sky-T1, open reproductions of R1 distillation
  • Theoretical analysis: RLVR implicitly incentivizes correct intermediate reasoning steps
  • Lab: Train a small model with RLVR on math problems using verifiable rewards; compare with DPO on the same task

Peer inside the black box. Understand how and why LLMs produce their outputs using probing, attention analysis, and mechanistic interpretability techniques. Identified as a gap vs. Berkeley CS294-267 Understanding LLMs.

17.1 Attention Analysis & Probing 🟡⚙️🔧
  • Attention visualization: what do attention heads look at?
  • Attention patterns: induction heads, previous-token heads, positional heads
  • Probing classifiers: what information is encoded in hidden states?
  • Probing classifiers methodology: linear vs. nonlinear probes (linear probes test what is linearly accessible, nonlinear probes may learn the task themselves); control tasks and selectivity (Hewitt and Liang, 2019), ensuring probes measure representation quality, not probe capacity; the "probing is not understanding" critique; practical applications: probing for syntactic structure, world knowledge, factual associations in transformer layers
  • Logit lens and tuned lens: reading the residual stream
  • Lab: Visualize attention patterns in a GPT-2 model; use probing to detect syntactic information in hidden layers
17.2 Mechanistic Interpretability 🔴⚙️🔧
  • Circuits and features: the mechanistic interpretability framework: features as directions in activation space, circuits as computational subgraphs
  • Sparse autoencoders (SAEs) architecture: encoder W_enc maps activation → high-dimensional sparse code (e.g., 4096 → 65536); ReLU + L1 sparsity penalty forces monosemantic features; decoder W_dec reconstructs activation; trained on cached activations from a target layer
  • Superposition: why neurons are polysemantic: more features than dimensions; the toy model of superposition (Elhage et al.); feature splitting at scale
  • Activation patching and causal tracing
  • TransformerLens and nnsight tooling
  • Anthropic's interpretability research: scaling monosemanticity
  • Lab: Use TransformerLens to find and analyze a simple circuit (e.g., indirect object identification) in a small model
  • Feature attribution: which input tokens matter most?
  • Integrated Gradients and SHAP for LLMs
  • ⚙️ Representation engineering: steering model behavior via activation vectors
  • ⚙️ Concept erasure and model editing (ROME, MEMIT)
  • ⚙️ Interpretability for debugging: understanding model failures
  • Explaining transformer predictions: attribution methods tailored for attention-based models
  • Attention rollout and attention flow: propagating attention through layers
  • Gradient-weighted attention: combining gradient signals with attention weights
  • Layer-wise relevance propagation (LRP) for transformers
  • Perturbation-based explanations: token removal, token substitution, and occlusion
  • Comparing explanation methods: faithfulness, plausibility, and consistency metrics
Part V: Retrieval & Conversation

Master the retrieval infrastructure that powers RAG systems.

  • From word embeddings to sentence embeddings: CLS token, mean pooling, [EOS] pooling
  • Training sentence embeddings end-to-end: Sentence-BERT (SBERT) architecture with siamese/triplet networks; SimCSE (unsupervised: dropout as augmentation, supervised: NLI pairs); contrastive loss, triplet loss with margin, and multiple negatives ranking loss
  • Contrastive learning for embeddings: InfoNCE loss, in-batch negatives, temperature parameter
  • Training pipeline: hard negative mining strategies (BM25 negatives, cross-encoder mined negatives, in-batch hard negatives); positive pair construction (anchor, positive, negative)
  • Multi-stage training: pre-training on weak pairs → fine-tuning on curated pairs (E5, GTE approach)
  • Sentence-BERT, E5, GTE, Nomic Embed: architecture comparison
  • Matryoshka embeddings: training with multiple dimensionality loss terms for flexible truncation
  • Late interaction models: ColBERT architecture: per-token embeddings with MaxSim scoring
  • API embeddings: OpenAI, Cohere Embed v3, Voyage AI, Jina: pricing and dimension choices
  • MTEB benchmark internals: task categories, score aggregation, choosing the right model
  • 🔬 Embedding space geometry: curse of dimensionality (distances concentrate in high-d), anisotropy problem (embeddings cluster in narrow cone), isotropy regularization techniques
  • 🔬 Similarity pitfalls: why cosine similarity can mislead: hubness problem (some vectors are near-neighbors of many others), importance of normalization
  • Fine-tuning embeddings: Sentence-Transformers library, domain-specific training data strategies
  • Lab: Fine-tune an embedding model on domain-specific data with contrastive loss; visualize embedding space isotropy before/after; compare recall@k
  • Exact nearest neighbor: brute-force O(nd): why it doesn't scale
  • Similarity metrics internals: cosine (normalized dot product), dot product (magnitude-sensitive), L2 distance: when to use which
  • HNSW internals: hierarchical navigable small world graph: multi-layer skip-list structure; greedy search from top layer; construction: insert node, connect to M nearest neighbors per layer; parameters: M (connections), efConstruction (build quality), efSearch (query quality); O(log n) search time
  • IVF internals: inverted file index: k-means clustering of vectors into nlist partitions; at query time, probe nprobe nearest centroids; tradeoff: more probes = higher recall, slower search
  • Product Quantization (PQ): split d-dimensional vector into m subvectors; quantize each to 256 centroids (1 byte); compress 768-dim float32 (3KB) to m bytes; asymmetric distance computation for query
  • Composite indexes: IVF-PQ (cluster then compress), HNSW+PQ (graph with compressed storage), IVF-HNSW (HNSW as coarse quantizer)
  • ScaNN: anisotropic vector quantization for inner product search
  • Index build time, memory footprint, and recall-latency tradeoffs
  • Lab: Build HNSW and IVF-PQ indexes in FAISS; benchmark recall@10 vs. query latency vs. memory; tune M, efSearch, nprobe parameters
18.3 Vector Database Systems 🟡⚙️🔧
  • Vector DB architecture: write-ahead log, segment-based storage, background index building
  • Managed: Pinecone (serverless architecture, pod-based scaling), Weaviate (module system, hybrid search built-in)
  • Self-hosted: Qdrant (Rust, gRPC, segment architecture), Milvus (distributed, segment-sealed architecture), ChromaDB (lightweight, SQLite backend)
  • Embedded: FAISS (C++ with Python bindings), LanceDB (columnar format, zero-copy)
  • pgvector: vector search inside PostgreSQL: IVFFlat and HNSW index types, when to use vs. dedicated vector DB
  • Metadata filtering: pre-filter vs. post-filter strategies, payload indexes in Qdrant
  • Hybrid search internals: combining BM25 keyword scores with vector similarity via reciprocal rank fusion (RRF) or linear combination
  • Lab: Index the synthetic dataset in Qdrant and pgvector; implement hybrid search; compare latency, recall, and operational complexity
18.4 Document Processing & Chunking 🟡⚙️🔧
  • Document loaders: PDF, HTML, Markdown, DOCX, code
  • Chunking strategies: fixed-size, recursive, semantic, document-structure-aware
  • Overlap, parent-child chunking, sentence-window approach
  • Unstructured.io, LlamaParse, Docling for document parsing
  • RAG data pipeline engineering: scheduled ETL with orchestration tools (Airflow, Prefect, Dagster); document versioning and staleness detection (content hashing, last-modified tracking); incremental indexing via change-data-capture patterns; data lineage tracking (which source documents contributed to each answer)
  • Embedding model version migration: parallel indexes during transition, lazy re-embedding, gradual cutover when switching embedding models
  • Lab: Build a document ingestion pipeline: parse → chunk → embed → index

Build production-quality RAG systems: from naive implementations to advanced architectures with re-ranking, query transformation, knowledge graphs, and deep research agents.

  • RAG system architecture: ingestion pipeline (parse → chunk → embed → index) + query pipeline (embed query → retrieve → rerank → augment prompt → generate)
  • Data flow: document store ↔ vector index ↔ retriever ↔ prompt builder ↔ LLM ↔ output parser
  • Naive RAG: single-stage retrieval with top-k context injection
  • Context window management: token budgeting, context ordering (lost-in-the-middle phenomenon), citation injection
  • When RAG beats fine-tuning (and vice versa): decision framework based on knowledge type, update frequency, and latency
  • Indexing strategies: full re-index vs. incremental; versioning documents; handling deletions and updates
  • Lab: Build a full RAG pipeline from scratch (no framework): chunker → embedder → FAISS index → retriever → prompt template → LLM; then rebuild with LangChain and compare
19.2 Advanced RAG Techniques 🔴⚙️🔧
  • Query transformation: HyDE, multi-query, step-back prompting
  • BM25 internals: TF saturation (k1 parameter), IDF smoothing (log((N-df+0.5)/(df+0.5))), document length normalization (b parameter): why it remains a strong baseline
  • Re-ranking with cross-encoders: architecture (BERT takes [query, SEP, document] as single input → relevance score); why cross-attention between query and doc tokens is more powerful than bi-encoder dot product; tradeoff: O(n×d) cost per query-doc pair vs. O(1) for bi-encoder
  • Cohere Rerank, ColBERT reranking, BGE-reranker: API and open-source options
  • 🔬 Contextual retrieval: prepending LLM-generated context to chunks before embedding (Anthropic approach)
  • 🔬 Corrective RAG (CRAG): LLM self-evaluates retrieval quality → triggers web search fallback if low confidence
  • 🔬 Self-RAG: model learns special tokens to decide when to retrieve, what to cite, and whether output is supported
  • Fusion retrieval: BM25 + dense vectors combined via Reciprocal Rank Fusion (RRF: score = Σ 1/(k + rank)) or linear interpolation
  • Multi-modal RAG: images, tables, and charts
  • LLM-ready web ingestion: Firecrawl, Crawl4AI for converting web pages to clean markdown for RAG pipelines
  • Lab: Upgrade the basic RAG with HyDE, re-ranking, and multi-query; measure improvements with RAGAS
  • Knowledge graphs: entities, relations, triples (subject, predicate, object); RDF, OWL, and property graph models; construction from unstructured text using NER and relation extraction
  • Graph embeddings: TransE, TransR, DistMult, ComplEx; representing entities and relations as vectors for link prediction and knowledge base completion
  • GraphRAG: combining knowledge graph traversal with LLM generation; structured queries over graph databases (Neo4j, Amazon Neptune) to augment LLM context; Microsoft GraphRAG architecture with community detection and hierarchical summarization
  • LLM-powered knowledge graph construction: entity extraction, relationship mapping, entity resolution, coreference, and relation canonicalization
19.4 Deep Research & Agentic RAG 🔴⚙️🔧
  • Deep research pattern: multi-step autonomous web research
  • Query decomposition → parallel search → synthesis → follow-up
  • OpenAI Deep Research, Perplexity, Google Deep Research paradigms
  • Iterative refinement: search → read → evaluate → search again
  • Source credibility assessment and citation verification
  • Combining web search, document retrieval, and database queries
  • Lab: Build a deep research agent that autonomously researches a topic across multiple sources, synthesizes findings, and produces a cited report
19.5 Structured Data & Text-to-SQL 🟡⚙️🔧
  • LLM on tabular data: serialization strategies (row-by-row, markdown tables, JSON); table understanding and reasoning; LLM-based feature engineering for structured data; comparison with XGBoost/LightGBM on tabular benchmarks
  • Text-to-SQL in depth: translating natural language to database queries; schema linking and column selection; multi-table joins and complex aggregation; error correction via execution feedback; schema representation and context injection strategies
  • Benchmarks: Spider, Bird, WikiSQL
  • Table understanding: reading CSVs, spreadsheets, and structured documents
  • Combining structured (SQL) and unstructured (vector) retrieval
  • Lab: Build a natural-language-to-SQL interface over a sample database; chain with RAG for hybrid answers
19.6 RAG Frameworks & Orchestration 🟡⚙️🔧
  • LangChain: chains, retrievers, memory, LCEL
  • LlamaIndex: index types, query engines, routers
  • Haystack by deepset
  • Lab: Implement the same RAG pipeline in LangChain and LlamaIndex; compare developer experience
Project Milestone: Build a production-grade RAG system with hybrid search, re-ranking, text-to-SQL, and citation tracking. Evaluate using RAGAS metrics.

Design and implement robust conversational AI: from simple chatbots to complex multi-turn dialogue systems with state management, memory, personas, and personality.

  • Types: task-oriented, open-domain, hybrid
  • Dialogue state tracking and slot filling
  • Turn management and context handling
  • System prompts as behavioral specification
  • Persona design: personality, tone, brand voice, backstory
  • AI companionship: Character.AI patterns, emotional engagement
  • AI creative writing assistants: ideation, co-writing, style transfer
  • Consistency challenges: maintaining persona over long conversations
  • Ethical considerations of parasocial AI relationships
20.3 Memory & Context Management 🟡⚙️🔧
  • Short-term memory: conversation buffer, sliding window
  • Long-term memory: summarization, vector store, entity extraction
  • 🔬 MemGPT / Letta architecture in depth: virtual context management with a hierarchical memory system; main context (working memory) vs. archival storage (long-term) vs. recall storage (conversation search); self-directed memory operations (push/pop/search); OS-inspired paging between memory tiers
  • Session persistence and user profiles
  • Lab: Build a chatbot with both short-term and long-term memory using a vector store
  • Handling clarifications, corrections, and topic switches
  • Guided conversations: form-filling, onboarding, intake flows
  • Fallback strategies and graceful degradation
  • Human handoff: when and how to escalate
  • Runtime context window overflow: when assembled prompt (system + history + retrieved docs + user query) exceeds context limit; priority-based content eviction (trim oldest conversation turns first, then reduce retrieved chunks, never trim system prompt); dynamic context budgeting: allocate percentages (system 10%, history 30%, retrieval 40%, generation 20%); truncation strategies: sentence-boundary truncation, summarize-then-truncate for conversation history
  • Lab: Build a customer support bot with guided flows, RAG-backed knowledge, and human handoff triggers
  • Speech-to-text: Whisper, Deepgram, AssemblyAI
  • Text-to-speech: ElevenLabs, PlayHT, Cartesia
  • Real-time voice AI: LiveKit, Vapi, Pipecat
  • Vision in conversations: processing images and screenshots
Part VI: Agents & Applications

Build autonomous AI agents that reason, plan, use tools, and take actions. Covers the four core agentic patterns: reflection, tool use, planning, and multi-agent collaboration.

21.1 Foundations of AI Agents 🟡⚙️
  • What is an agent: perception → reasoning → action loop: formal definition vs. practical usage
  • Agent vs. chain vs. workflow: definitions, tradeoffs, and decision criteria
  • The four agentic design patterns (Ng framework): Reflection, Tool Use, Planning, Multi-Agent
  • ReAct pattern internals: interleaved Thought/Action/Observation traces in the prompt: how reasoning tokens guide tool selection
  • Agent state machine: states (thinking, tool_calling, waiting_for_result, responding), transitions, termination conditions
  • 🔬 Cognitive architectures: System 1 (fast, single-pass) vs. System 2 (deliberate, multi-step) agent designs
  • Agent memory data structures: conversation buffer (deque), episodic memory (vector store + metadata), working memory (structured state dict), semantic memory (knowledge graph)
  • Token budget management: context window allocation across system prompt, memory, retrieved docs, conversation history, and generation
21.2 Tool Use & Function Calling 🟡⚙️🔧
  • OpenAI function calling / tool use
  • Anthropic tool use (Claude)
  • Designing effective tool schemas
  • Tool result handling and multi-step tool use
  • MCP (Model Context Protocol): Anthropic's open standard for tool integration: servers, resources, prompts
  • A2A (Agent-to-Agent Protocol): Google's protocol for inter-agent communication
  • Building custom tools: APIs, databases, file systems, code execution
  • Browser automation agents: Browser Use (Python, 50K+ stars, turns any LLM into a browser agent), Stagehand (TypeScript SDK with act/extract/observe primitives)
  • LLM-ready web scraping: Firecrawl (API converting websites to clean markdown for LLM consumption), Crawl4AI (open-source alternative, 58K+ stars)
  • 🔬 Native tool use training: how frontier models (GPT-4, Claude, Gemini) are trained with tool-calling in the training data, not just prompted; the Toolformer approach (self-supervised tool-use annotation); training data format: interleaved text and tool calls with execution results; reward shaping for tool selection accuracy and efficiency; fine-tuning for domain-specific tools; the gap between prompted tool use and natively trained tool use (reliability, latency, hallucinated calls)
  • 🔬 Agentic training at scale: DeepSeek V3.2 trained on 85,000+ agentic tasks spanning web search, coding, file operations, and multi-step tool use; represents a shift from "tool-use as prompting" to "tool-use as a core training objective"; enables reliable multi-step tool chains without explicit orchestration
  • Lab: Build an agent with 5+ tools (web search, calculator, database, file I/O, API calls)
21.3 Planning & Agentic Reasoning 🔴⚙️🔧
  • Plan-and-execute: upfront planning with iterative execution
  • Agentic reflection loops: detect failure → diagnose → retry with different strategy
  • LATS (Language Agent Tree Search): Monte Carlo tree search for agents
  • LLM Compiler: parallel function calling
  • Human-in-the-loop: when to ask for help
  • Lab: Implement a plan-and-execute agent that breaks down complex tasks, executes steps, and self-corrects
  • Code interpreters: sandboxed execution (E2B, Modal)
  • Data analysis agents: natural language to pandas/SQL
  • Code generation and self-debugging patterns
  • Software engineering agents: Devin-style coding assistants
  • Security: sandboxing, permission models, resource limits
  • Lab: Build a data analysis agent that writes and executes Python code in a sandbox

Scale from single agents to multi-agent architectures. Learn modern agent frameworks, orchestration patterns, and how to build complex systems where multiple agents collaborate.

22.1 Agent Frameworks 🟡⚙️🔧
  • LangGraph internals: directed graph of nodes (functions) and edges (routing); TypedDict state channels passed between nodes; conditional edges for branching; checkpoint serialization (SQLite/Postgres) for pause/resume; built-in persistence for conversation threads
  • CrewAI in depth: role-based multi-agent collaboration; agent definition (role, goal, backstory, tools); task objects with expected outputs; sequential and hierarchical process types; delegation and inter-agent communication patterns
  • AutoGen / AG2 in depth: conversational multi-agent patterns; AssistantAgent, UserProxyAgent, GroupChat, and GroupChatManager; code execution in Docker sandboxes; human-in-the-loop integration; conversation termination strategies; multi-agent debate and reflection patterns
  • OpenAI Agents SDK
  • Anthropic Claude Agent SDK
  • Smolagents (Hugging Face): lightweight agent framework
  • PydanticAI: type-safe agent development
  • Google ADK (Agent Development Kit): multi-agent orchestration
  • Lab: Build the same agent in LangGraph, CrewAI, and native SDK; compare patterns
  • Supervisor pattern: orchestrator delegates to specialists
  • Debate pattern: agents argue for better answers
  • Pipeline pattern: sequential processing stages
  • Hierarchical agents: manager → workers
  • Shared memory and message passing between agents
  • 🔬 Conformity effects in multi-agent LLM systems (ICLR 2025): agents tend to converge on similar outputs (groupthink); factors: model homogeneity, communication structure, majority influence; mitigation: diverse model mixtures, structured debate protocols, devil's advocate roles, independent reasoning before consensus
22.3 Agentic Workflows & Pipelines 🔴⚙️🔧
  • Workflow engines: LangGraph state machines, Temporal for durable execution
  • Conditional branching, loops, and parallel execution
  • Error handling, retries, and compensation logic
  • Checkpointing and resumability
  • Streaming intermediate results to users
  • Lab: Build a multi-agent research system: Planner → Researcher → Writer → Reviewer with human-in-the-loop approval
Project Milestone: Build the full conversational AI agent combining fine-tuned model, RAG, tools, deep research, multi-step planning, reflection, and memory.
Module 23

Multimodal Generation

Extend LLMs beyond text into image, audio, video, and 3D generation. Understand the architectures behind the most impactful generative AI systems.

  • Diffusion models: DDPM fundamentals, denoising process, latent diffusion (Stable Diffusion architecture)
  • Flow matching: rectified flows, Flux architecture: the post-diffusion paradigm
  • Stable Diffusion 3/XL, DALL-E 3, Midjourney, Imagen 3: architecture comparison
  • Image editing: inpainting, outpainting, ControlNet, IP-Adapter, reference-based generation
  • Vision Transformer (ViT): patch-based image tokenization, position embeddings for 2D, classification with [CLS] token; comparison with CNNs on data efficiency and scaling
  • CLIP: contrastive language-image pre-training; dual encoder architecture (image encoder + text encoder); InfoNCE contrastive loss over batch of image-text pairs; zero-shot image classification via text prompts; CLIP as a backbone for downstream vision tasks
  • BLIP / BLIP-2: bootstrapping language-image pre-training; image captioning, visual QA, and image-text retrieval; Q-Former architecture bridging frozen image encoder and frozen LLM; three-stage pre-training strategy
  • Vision-language understanding: GPT-4V, LLaVA, Qwen-VL, PaliGemma: visual encoder + LLM fusion
  • Gemini-style native multimodal: interleaved image-text generation
  • Lab: Build a product image generation pipeline with Stable Diffusion + ControlNet; integrate GPT-4V for quality assessment
  • Text-to-speech: VITS, Bark, F5-TTS: modern zero-shot voice synthesis
  • Voice cloning and voice design: speaker embeddings, voice conversion
  • Real-time conversational audio: GPT-4o native audio, Moshi (Kyutai)
  • Music generation: MusicLM, Suno, Udio, Stable Audio: commercial applications
  • Text-to-video: Sora (DiT architecture), Runway Gen-3, Kling 2, Veo 2: architectures and limitations
  • 3D generation: text-to-3D, image-to-3D: emerging approaches
  • Multimodal composition pipelines: chaining text → image → video → audio
  • TrOCR: transformer-based optical character recognition; encoder-decoder architecture with ViT encoder and text decoder; pre-training on synthetic data, fine-tuning on handwritten and printed text
  • LayoutLM / LayoutLMv2 / LayoutLMv3: pre-training for document understanding; jointly modeling text, layout (2D position), and image; document classification, key-value extraction, and table detection; applications in invoice processing, form understanding, and receipt parsing
  • Document AI pipeline: OCR → layout analysis → entity extraction → structured output
  • Comparison of document understanding approaches: LayoutLM family vs. multimodal LLMs (GPT-4V) vs. specialized OCR pipelines

Survey the most impactful real-world applications of LLMs across industries. For each domain, understand the architecture patterns, unique challenges, risks, and the current state of the art.

  • The "vibe-coding" paradigm: building software via natural language intent rather than manual code
  • Code completion engines: Copilot (Codex/GPT-4), Cursor (multi-file context), Cline, Windsurf: how they work under the hood
  • Fill-in-the-middle (FIM) architecture for inline code completion: prefix/suffix/middle prompting
  • Agentic coding: Claude Code, Devin, OpenHands, SWE-Agent: autonomous multi-file editing, test-driven development loops
  • Code generation from specs: natural language → working application (Bolt, v0, Lovable, Replit Agent)
  • SWE-bench: evaluating coding agents on real GitHub issues
  • Context engineering for code: repo maps, AST parsing, dependency graphs, file ranking
  • Risks: hallucinated APIs, security vulnerabilities in generated code, over-reliance, licensing concerns
  • Impact on software engineering: productivity data, skill shifts, junior vs. senior developer effects
  • Lab: Build a mini "vibe-coding" agent that takes a feature description, generates code, writes tests, runs them, and iterates until passing: using tool use and reflection
24.2 LLMs in Finance & Trading 🟡⚙️🔧
  • Financial NLP: sentiment analysis on earnings calls, SEC filings, news, social media
  • FinGPT, BloombergGPT: domain-specific financial models: training data and architecture
  • Automated report generation: earnings summaries, market research, risk assessments
  • Trading signal extraction: event detection, entity recognition in financial text
  • LLM-powered financial advisors: robo-advisory with conversational interface
  • Regulatory compliance: automated KYC/AML text analysis, regulatory change monitoring
  • Fraud detection: anomaly detection in transaction narratives
  • Risks: hallucinated financial data, market manipulation potential, regulatory concerns (SEC, FINRA)
  • Lab: Build a financial news sentiment analyzer with RAG over SEC filings; generate an automated earnings summary
  • Medical LLMs: Med-PaLM 2, BioMistral, Meditron: training on clinical corpora
  • Clinical NLP: ICD coding, clinical note summarization, patient intake automation
  • Medical Q&A and differential diagnosis assistance
  • Drug discovery: molecular generation, property prediction, literature mining
  • Protein and genomics: AlphaFold, ESM, DNA language models
  • Radiology and pathology: multimodal models for medical imaging
  • Mental health applications: therapy chatbots, crisis detection, ethical boundaries
  • Regulatory: HIPAA compliance, FDA software-as-medical-device (SaMD), CE marking
  • Safety-critical considerations: hallucination risk in medical advice, liability, clinician-in-the-loop requirements
  • LLMs as recommendation engines: replacing/augmenting collaborative filtering with semantic understanding
  • Conversational recommendation: dialogue-driven product/content discovery
  • LLM-powered search: from keyword matching to semantic understanding (Perplexity, Google AI Overviews, SearchGPT)
  • User preference modeling: extracting interests from natural language interactions
  • Cold start solution: LLMs for zero-shot recommendation via item description understanding
  • E-commerce: product description generation, review summarization, personalized shopping assistants
  • Content recommendation: news, video, music: LLM-based content understanding and matching
  • Evaluation: beyond click-through: measuring recommendation quality with LLM judges
  • Lab: Build a conversational movie recommender using LLM + embedding-based retrieval; compare with traditional collaborative filtering
24.5 Cybersecurity & LLMs 🟡⚙️🔧
  • Defensive applications: threat intelligence summarization, log analysis, anomaly explanation
  • Vulnerability detection: LLM-powered static analysis, code audit automation
  • Phishing and social engineering: LLM-generated attacks and LLM-based detection
  • Malware analysis: binary reverse engineering assistance, decompiled code explanation
  • Security Operations Center (SOC) automation: alert triage, incident summarization, playbook generation
  • CTF and penetration testing: LLM agents for automated security testing (authorized contexts only)
  • Adversarial uses: deepfake text, automated disinformation, voice cloning for fraud
  • Defense: AI-generated content detection, watermarking, provenance tracking
  • Lab: Build a log analysis agent that ingests security logs, detects anomalies, explains findings, and suggests remediation
  • Education: AI tutoring (Khanmigo, Duolingo Max), personalized learning paths, automated grading, Socratic dialogue
  • Legal: contract analysis, case law research, legal document drafting, e-discovery automation
  • Creative writing & content: AI co-writing tools, screenplay generation, marketing copy, localization
  • Customer support: automated ticket resolution, sentiment-aware routing, knowledge base generation
  • Enterprise search & knowledge management: Glean, internal chatbots over corporate documents
  • Gaming: NPC dialogue generation, dynamic storylines, procedural quest design
  • Real estate, HR, insurance: industry-specific applications and their LLM architectures
  • LLMs as robot planners: translating natural language goals into action sequences
  • SayCan, RT-2, PaLM-E: grounding language in physical actions and observations
  • Web automation agents: browser control, form filling, UI testing (WebArena, Anthropic computer use)
  • OS-level agents: desktop automation, multi-application workflows
  • AI for mathematics: formal reasoning, theorem proving (Lean, AlphaProof, DeepSeek-Prover)
  • Scientific literature: automated meta-analysis, hypothesis generation, experiment design
  • Materials science, chemistry: molecular property prediction, retrosynthesis planning
  • Domain adaptation strategies: when to fine-tune vs. RAG vs. prompt for each vertical

You can't improve what you can't measure. Learn systematic approaches to evaluating LLM outputs, designing rigorous experiments, testing agent behavior, and monitoring production systems.

25.1 LLM Evaluation Fundamentals 🟡⚙️🔧
  • Information-theoretic foundations: cross-entropy loss = -E[log p(x)]; perplexity = 2H(p) = exp(loss): why perplexity is the standard LLM metric; bits-per-byte (BPB) for tokenizer-agnostic comparison
  • Classical NLP metrics: BLEU (n-gram precision + brevity penalty), ROUGE (recall-oriented), METEOR (alignment-based), BERTScore (contextual embedding similarity)
  • LLM-as-Judge: using models to evaluate models: pairwise comparison, pointwise scoring, reference-free grading; position bias and self-preference bias
  • Human evaluation: inter-annotator agreement (Cohen's κ, Fleiss' κ), ranking (ELO/Bradley-Terry from pairwise comparisons, as in Chatbot Arena), Likert scales
  • Task-specific metrics: accuracy, F1, pass@k for code (unbiased estimator from n samples)
  • Benchmarks: MMLU, HumanEval, MT-Bench, AlpacaEval, Chatbot Arena (crowdsourced ELO), GPQA, MATH, ARC: what each measures and their limitations
  • Lab: Evaluate the fine-tuned model on a custom benchmark; compare with base model and API models
  • Statistical significance testing for LLM comparisons: bootstrap, paired tests
  • Confidence intervals and effect sizes
  • Controlling for randomness: seed management, temperature=0 vs. sampling
  • Ablation study design: isolating the impact of each component
  • Common pitfalls: data contamination, benchmark gaming, cherry-picking
  • 🔬 Benchmark contamination detection: methods for identifying when test data leaked into training; n-gram overlap analysis between training corpus and benchmark; membership inference (model confidence on seen vs. unseen examples); canary string insertion (embed unique strings in data, check if model memorizes them); perturbation-based detection (rephrase questions, check if accuracy drops); the scale of the problem: many popular benchmarks are partially contaminated in frontier models
  • Reproducibility: documenting hyperparameters, data versions, compute
  • Lab: Design and execute a rigorous ablation study comparing RAG strategies with proper statistical analysis
25.3 RAG & Agent Evaluation 🟡⚙️🔧
  • RAG metrics: RAGAS (faithfulness, answer relevancy, context precision/recall)
  • Agent evaluation: task completion, tool accuracy, efficiency
  • Trajectory evaluation: evaluating the path, not just the outcome
  • Evaluation frameworks: DeepEval, Ragas, Phoenix
  • Lab: Run RAGAS evaluation on the project RAG system; evaluate agent trajectories
25.4 Testing LLM Applications 🟡⚙️
  • Unit testing with mocked LLM responses
  • Integration testing with real models
  • Regression testing: detecting quality degradation
  • Red teaming and adversarial testing
  • Prompt injection testing
  • CI/CD integration for LLM evaluations
  • Testing non-deterministic LLM outputs: assertion-based testing (check for required elements, not exact match); embedding-similarity thresholds for output validation; property-based testing (output always valid JSON, always contains required fields); LLM-judge-in-CI (automated quality gates)
  • Golden-file/snapshot testing with drift alerting; contract testing for tool call schemas; load testing LLM endpoints (Locust, k6, GuideLLM); promptfoo for prompt regression testing in CI/CD
  • CI/CD pipeline design: run evals on PR, compare to baseline, gate deployment on quality thresholds
25.5 Observability & Tracing 🟡⚙️🔧
  • LLM tracing: capturing the full chain of calls
  • LangSmith: tracing, evaluation, and prompt management
  • Langfuse: open-source LLM observability
  • Phoenix by Arize: traces, evals, and debugging
  • LangWatch (unified observability + evaluations + prompt optimization), TruLens (RAG-focused evaluation: faithfulness, relevance, groundedness feedback functions)
  • Logging: prompt/completion pairs, latency, token usage, costs
  • Alerting on quality degradation and anomalies
  • Lab: Instrument the project agent with Langfuse tracing; build a monitoring dashboard
  • Prompt drift: how evolving user behavior degrades prompt effectiveness over time
  • Provider version drift: detecting quality changes when OpenAI/Anthropic silently update models
  • Embedding drift in RAG: documents change, embeddings become stale: re-indexing strategies
  • Output quality monitoring: automated LLM-judge scoring on production traffic samples, statistical process control charts
  • Data quality monitoring for LLM pipelines: detecting stale/corrupted documents in knowledge bases, schema validation for structured LLM outputs (Great Expectations, Soda)
  • Retraining and re-tuning triggers: when production data signals fine-tuning refresh
  • Lab: Build a monitoring pipeline that detects embedding drift and output quality degradation; set up automated alerts and re-indexing triggers
  • The reproducibility challenge: stochastic LLM outputs, provider API changes, non-deterministic retrieval
  • Versioning the full stack: prompt templates + retrieval config + model version + system prompt: as a single reproducible artifact
  • Seed management: temperature=0 vs. sampling, provider-specific determinism options
  • Configuration management: Hydra, OmegaConf, or YAML configs for LLM pipeline parameters
  • Dataset versioning: DVC for tracking training data, evaluation sets, and RAG corpora
  • LLMOps within broader MLOps: unified platforms (MLflow, W&B) that track both classical ML experiments and LLM pipeline runs
  • Environment reproducibility: Docker for LLM serving, pinning library versions, model snapshot management
Part VII: Production & Strategy

Take LLM applications from notebook to production. Cover deployment, scaling, security hardening, and the ethical and regulatory frameworks for responsible AI systems.

  • Backend frameworks: FastAPI, LitServe for LLM APIs
  • Streaming responses: SSE, WebSockets
  • Containerization: Docker, Docker Compose
  • Cloud deployment: AWS (Bedrock, SageMaker), GCP (Vertex AI), Azure
  • Serverless: Modal, Replicate, Hugging Face Inference Endpoints
  • Lab: Deploy the project agent as a FastAPI service with Docker; add streaming and health checks
26.2 Frontend & User Interfaces 🟡⚙️🔧
  • Gradio: rapid prototyping for AI demos
  • Streamlit: interactive dashboards with LLM integration
  • Chainlit: production chat interfaces
  • Open WebUI: self-hosted ChatGPT-like interface
  • Vercel AI SDK for Next.js applications
  • Lab: Build a polished chat interface for the project agent using Chainlit
  • Production latency optimization: streaming responses, request batching, queue management (for model-level optimization see Module 8; for cost-performance tradeoffs see Module 11.4)
  • Rate limiting, queuing, and backpressure patterns
  • Auto-scaling strategies for LLM workloads: GPU provisioning, serverless inference
  • Guardrails: NeMo Guardrails, Guardrails AI, Lakera: input/output filtering in production
  • Open-source safety classifiers: Llama Guard 3/4 (Meta's content safety model for input/output moderation), Prompt Guard (dedicated prompt injection detector), ShieldGemma (Google's safety classifier)
  • Prompt versioning and management
  • A/B testing LLM configurations
  • Online evaluation and feedback loops
  • Data flywheels: production data → fine-tuning → improved model
  • Model registry and artifact management
26.5 LLM Security Threats 🟡⚙️🔧
  • OWASP Top 10 for LLM Applications
  • Prompt injection: direct and indirect attacks
  • Jailbreaking techniques and defenses
  • Prompt injection defense implementation: input sanitization (strip special tokens, detect injection patterns); sandwich defense (user input between system instructions); delimiter hardening (XML tags, random delimiters); output scanning (detect leaked system prompts, PII in responses using Presidio/regex); LLM-as-judge for injection detection (separate classifier model); runtime PII redaction before user-facing output; API key management: secrets managers (Vault, AWS Secrets Manager), key rotation, per-user API key proxying
  • Data leakage: training data extraction, PII exposure
  • Supply chain risks: model poisoning, backdoors
  • 🔬 Formal verification of LLM behavior: applying formal methods to neural networks; certified robustness for NLP (provable bounds on adversarial perturbation sensitivity); abstract interpretation for transformers; verification of safety properties (can we prove a model never generates certain outputs?); current limitations: scalability to billion-parameter models; connection to constitutional AI and runtime guardrails
  • Lab: Red-team the project agent; implement prompt injection defenses and input/output guardrails
  • Types of hallucination: factual, faithfulness, instruction
  • Detection: self-consistency, citation verification, NLI
  • Mitigation: RAG, constrained generation, confidence calibration
  • When to say "I don't know": abstention and uncertainty
26.7 Bias, Fairness & Ethics 🟡⚙️
  • Sources of bias in LLMs: training data, RLHF, prompting
  • Measuring bias: benchmarks, disparate impact, representation
  • Responsible AI frameworks and documentation (model cards, datasheets)
  • Environmental impact of LLM training and inference
26.8 Regulation & Compliance 🟡⚙️
  • EU AI Act: risk classification and requirements
  • GDPR implications for LLM systems
  • US executive orders and state-level AI legislation
  • Industry-specific: healthcare (HIPAA), finance, education
  • AI governance: policies, auditing, and transparency
  • Enterprise LLM model inventory: cataloging all deployed LLMs, their use cases, risk levels, and owners
  • Model risk classification: materiality assessment: which LLM decisions require human oversight?
  • Regulatory model validation frameworks: SR 11-7 (banking), NIST AI RMF, ISO 42001: applied to LLM systems
  • Audit trails for LLM decisions: logging inputs, outputs, retrieval context, and model versions for compliance
  • Model lifecycle management: approval gates for deployment, periodic review cadence, decommissioning
  • Third-party model risk: governance for API-based models where you don't control the weights
  • Model license taxonomy: truly open (Apache 2.0, MIT), restricted open (Llama Community License, Gemma Terms), proprietary API
  • Commercial use restrictions: which models can you deploy in production? Fine-tuning and distillation clauses
  • IP ownership: who owns fine-tuned weights? Outputs generated by the model? Synthetic training data?
  • Training data copyright: NYT v. OpenAI implications, opt-out mechanisms (robots.txt, do-not-train headers)
  • LLM-powered anonymization: using LLMs for PII detection and masking in data pipelines
  • Differential privacy for synthetic data: formal privacy guarantees when generating training datasets
  • Privacy-preserving fine-tuning: federated learning approaches, on-device adaptation
26.11 Machine Unlearning 🔴⚙️
  • 🔬 Machine unlearning: methods to remove specific knowledge from trained models
  • Motivations: GDPR right-to-be-forgotten, copyright removal, safety (removing hazardous knowledge)
  • Gradient ascent unlearning (maximize loss on target data)
  • LOKA for continual unlearning without full retraining
  • Evaluation: how to verify knowledge was truly removed vs. merely suppressed
  • The fundamental tension: unlearning specific facts while preserving general capabilities
  • Current limitations: existing benchmarks may be inadequate (CMU 2025)

The business and organizational layer that turns LLM technology into business value. Covers strategy, product thinking, ROI measurement, vendor evaluation, and compute planning. Addresses critical gaps from the Head of AI and Head of Data Science perspectives.

  • Assessing organizational AI readiness: data maturity, engineering capability, cultural factors
  • Use case identification: mapping business processes to LLM capabilities (generation, extraction, classification, reasoning, conversation)
  • Prioritization frameworks: impact × feasibility matrix, time-to-value, risk-adjusted scoring
  • Building the business case: from proof-of-concept → pilot → production: stage gates and success criteria
  • Common failure modes: "solution looking for a problem," over-scoping, underestimating data needs
  • AI roadmap construction: sequencing LLM initiatives over 6-18 months
27.2 LLM Product Management 🟡⚙️
  • Translating business problems into LLM requirements: what does "make our customer support better" actually mean?
  • Defining success metrics beyond model accuracy: CSAT, resolution rate, deflection rate, time-to-resolution, user adoption
  • Managing the hallucination risk in product context: what is the acceptable error rate? What are the consequences of wrong answers?
  • User experience design for LLM products: setting expectations, showing confidence, graceful failure
  • Iterative delivery: ship a prompt-based MVP → add RAG → fine-tune → add agents: incremental value at each step
  • Stakeholder communication: explaining LLM capabilities and limitations to non-technical executives
  • Managing user trust: transparency about AI-generated content, appropriate disclaimers
  • LLM ROI framework: cost savings (automation) + revenue impact (new capabilities) + productivity gains (augmentation)
  • Measuring coding assistant ROI: developer velocity metrics, PR throughput, time-to-merge, code quality indicators
  • Measuring customer support automation ROI: ticket deflection rate, average handle time reduction, CSAT impact
  • Measuring knowledge worker productivity: time studies, task completion rates, output quality
  • Attribution challenges: isolating LLM impact from other factors, A/B testing LLM features
  • Common pitfalls: vanity metrics (tokens generated), counting cost savings without quality checks, ignoring maintenance costs
  • Lab: Build an ROI model for the project's conversational AI agent: estimate cost savings, productivity gains, and payback period
  • LLM provider evaluation: model quality, pricing (per-token, per-seat, committed), SLAs, data privacy guarantees, fine-tuning support
  • Vector database vendor evaluation: managed vs. self-hosted, scaling characteristics, hybrid search support, pricing models
  • Agent framework evaluation: maturity, community, production readiness, lock-in risk
  • Vendor platform solutions (Glean, Moveworks, Cohere Enterprise, AWS Bedrock Agents) vs. building in-house
  • Build vs. buy decision tree: control needs, customization depth, team capability, time-to-market, total cost
  • Procurement considerations: enterprise agreements, data processing agreements, exit clauses
  • Compute budgeting: modeling costs for training runs (GPU-hours × price) and inference (tokens/day × cost/token)
  • Cloud strategy: on-demand vs. reserved instances vs. spot for training; GPU selection (A100, H100, L40S) for different workloads
  • Self-hosted vs. API: breakeven analysis: at what volume does self-hosting become cheaper?
  • Inference infrastructure planning: estimating peak QPS, provisioning GPUs, auto-scaling strategies
  • Multi-cloud and hybrid: running training on one cloud, inference on another, RAG on-premises
  • Capacity planning: forecasting compute needs as usage grows: tokens/day projections, seasonal patterns
Capstone

Final Project: End-to-End Conversational AI Agent

Integrate everything from the course into a complete, deployable conversational AI system built on synthetic data: demonstrating mastery of the full LLM application stack.

C.1 Project Requirements 🎯
  • Synthetically generated training dataset (10K+ examples) with quality metrics
  • Fine-tuned model (QLoRA) with evaluation against baseline
  • Optional: knowledge distillation or model merging for optimized variant
  • RAG system with hybrid search, re-ranking, and text-to-SQL over a domain knowledge base
  • Agent with tool use (3+ tools), planning, reflection, memory, and self-correction
  • Deep research capability for multi-step information gathering
  • Production deployment with API, chat UI, and observability
  • Security hardening: prompt injection defenses, input/output guardrails
  • Evaluation suite with statistical rigor: automated tests, human evaluation, ablation study
  • Hybrid architecture: classical ML triage + LLM for complex cases, with cost-performance analysis
  • ROI analysis: business case with TCO, productivity gains, and payback period
  • LLM risk governance documentation: model card, audit trail, licensing compliance
C.2 Deliverables 🎯
  • GitHub repository with clean code, documentation, and CI/CD
  • Hugging Face Hub: fine-tuned model adapter and synthetic dataset
  • Technical report: architecture decisions, ablation study, evaluation results with confidence intervals
  • Interpretability analysis: attention visualization or feature analysis of key behaviors
  • Live demo: deployed application with monitoring dashboard
  • Presentation: 15-minute project walkthrough
Appendices
Appendix A

Mathematical Foundations

Reference appendix covering the essential mathematical background for understanding LLMs: linear algebra, calculus, probability, information theory, and optimization.

A.1 Linear Algebra Review 🟢📐
  • Vectors, matrices, dot products, matrix multiplication
  • Eigenvalues and eigenvectors
A.2 Calculus Essentials 🟢📐
  • Derivatives, chain rule, partial derivatives
  • Gradients and Jacobian matrices
A.3 Probability & Statistics 🟢📐
  • Bayes' theorem
  • Distributions: Gaussian, categorical, Bernoulli
  • Expectation and variance
A.4 Information Theory 🟡📐
  • Entropy, cross-entropy, KL divergence
  • Mutual information and perplexity
  • Derivations and intuition for each concept
A.5 Optimization Theory 🟡📐
  • Convexity and gradient descent convergence
  • Learning rate schedules and saddle points
Appendix B

Machine Learning Essentials

Core machine learning concepts that underpin LLM training and evaluation: learning paradigms, loss functions, training pipelines, evaluation metrics, and classical algorithms.

B.1 Learning Paradigms & Loss Functions 🟢📐
  • Supervised, unsupervised, and reinforcement learning taxonomy
  • Loss functions: MSE, cross-entropy, hinge loss
B.2 Training Pipeline & Evaluation 🟢📐
  • Train/val/test splits, overfitting, underfitting, regularization (L1, L2, dropout)
  • Evaluation metrics: accuracy, precision, recall, F1, ROC-AUC, confusion matrix
B.3 Reinforcement Learning Foundations 🟡📐
  • Agent, environment, state, action, reward, policy, value function
  • Policy gradient theorem and PPO intuition
B.4 Classical Algorithms & Feature Engineering 🟢📐
  • Classical algorithms overview: logistic regression, decision trees, random forests, XGBoost, k-means, PCA
  • Feature engineering and selection basics
Appendix C

Python for LLM Development

Python tooling and practices essential for LLM development: environment management, key libraries, async programming, type safety, and debugging.

C.1 Environment & Package Management 🟢⚙️
  • Virtual environments: venv, conda, uv
  • Package management: pip, requirements.txt, pyproject.toml
  • Jupyter notebooks and Google Colab workflow
C.2 Essential Libraries & Async Programming 🟢⚙️
  • Essential libraries: numpy, pandas, matplotlib, seaborn
  • Async programming: asyncio, aiohttp for parallel API calls
C.3 Type Safety, Data Handling & Debugging 🟡⚙️
  • Type hints, dataclasses, and Pydantic models
  • Working with JSON, YAML, and configuration files
  • Debugging and profiling tools
Appendix D

Environment Setup & Cloud Provisioning

Step-by-step guides for setting up local and cloud development environments for LLM work: GPU setup, model serving, API keys, cloud instances, and containerization.

D.1 Local Setup & Model Serving 🟢⚙️🔧
  • Local setup: Python 3.10+, CUDA toolkit, PyTorch with GPU support
  • Local model serving: Ollama installation and usage, llama.cpp setup
  • Hugging Face: CLI installation, token setup, model downloads, cache management
  • API key setup: OpenAI, Anthropic, Google AI Studio
D.2 Cloud GPU & Serverless Options 🟡⚙️
  • Cloud GPU instances: AWS (p4d/p5), GCP (A100/H100), Azure (ND series); spot vs. reserved pricing
  • Serverless GPU: Modal, RunPod, Lambda Labs, Google Colab Pro
D.3 Docker & Remote Access 🟡⚙️🔧
  • Docker basics: containerizing LLM applications, GPU passthrough with NVIDIA Container Toolkit
  • SSH tunneling to remote GPU machines
Appendix E

Git & Collaboration for ML Projects

Version control and collaboration practices tailored for machine learning projects: experiment tracking, data versioning, and notebook management.

E.1 Git for ML Experiments 🟢⚙️🔧
  • Git essentials for experiment tracking
  • Branching strategies for ML experiments
  • .gitignore patterns for ML projects (checkpoints, datasets, cache)
E.2 Data Version Control & Experiment Tracking 🟡⚙️
  • DVC (Data Version Control) for large files and datasets
  • Experiment tracking integration: W&B, MLflow
  • Notebook version control best practices (nbstripout, Jupytext)
Appendix F

Glossary of Terms

Comprehensive alphabetical glossary of 300+ technical terms used throughout the course. Each entry includes a concise definition and a reference to the module where it is first introduced.

F.1 Full Glossary 🟢📐
  • Key terms: attention, autoregressive, BERT, BPE, chain-of-thought, contrastive learning, cross-entropy, DPO, embedding, fine-tuning, GQA, hallucination, in-context learning, KV cache, LoRA, MoE, perplexity, PEFT, PPO, prompt engineering, quantization, RAG, RLHF, RLVR, RoPE, softmax, tokenizer, transformer, vector database, and 270+ more
Appendix G

Hardware & Compute Reference

Quick-reference tables for GPU specifications, VRAM requirements, training cost estimates, and guidance on when to use different compute tiers.

G.1 GPU Comparison & VRAM Requirements 🟡⚙️
  • GPU comparison table: A100 (80GB, 2TB/s, 312 TFLOPS FP16), H100 (80GB, 3.35TB/s, 989 TFLOPS FP16), H200 (141GB, 4.8TB/s), L40S (48GB, consumer-grade), RTX 4090 (24GB, hobbyist)
  • VRAM requirements by model size: 7B (14GB FP16, 4GB INT4), 13B (26GB FP16, 7GB INT4), 70B (140GB FP16, 35GB INT4)
G.2 Training & Inference Benchmarks 🟡⚙️
  • Training time estimates and cost benchmarks
  • Inference throughput benchmarks by model and hardware
  • When to use CPU, single GPU, multi-GPU, or multi-node
Appendix H

Model Card Quick Reference

One-page summaries for the 20 most-used models, covering architecture type, parameter count, context window, license, key strengths, and API access.

H.1 Proprietary Models 🟢📐
  • GPT-4o, Claude 3.5/4, Gemini 2.5
  • For each: architecture type, parameter count, context window, license, key strengths, API access
H.2 Open-Weight LLMs 🟢📐
  • Llama 3/4, Mistral/Mixtral, DeepSeek V3/R1, Phi-4, Qwen 2.5
  • For each: architecture type, parameter count, context window, license, key strengths, API access
H.3 Specialized & Encoder Models 🟢📐
  • BERT, RoBERTa, T5, Whisper, CLIP, Stable Diffusion, Sentence-BERT/E5
  • For each: architecture type, parameter count, context window, license, key strengths, API access
Appendix I

Prompt Template Library

Ready-to-use prompt templates organized by task type: classification, extraction, summarization, code generation, evaluation, synthetic data, and agent systems.

I.1 Classification & Extraction Prompts 🟢⚙️🔧
  • Classification prompts (sentiment, intent, topic)
  • Extraction prompts (NER, relation extraction, structured output)
I.2 Summarization & Code Generation Prompts 🟡⚙️🔧
  • Summarization prompts (abstractive, extractive, multi-document)
  • Code generation prompts (function generation, debugging, code review)
I.3 Evaluation, Synthetic Data & Agent Prompts 🟡⚙️🔧
  • Evaluation prompts (LLM-as-judge templates, pairwise comparison)
  • Synthetic data generation prompts (persona-driven, domain-specific)
  • Agent system prompts (ReAct, tool-use, planning)
Appendix J

Dataset & Benchmark Reference

Comprehensive reference for major LLM benchmarks and datasets, organized by category. For each: what it measures, size, known limitations, and contamination status.

J.1 Language Understanding & Reasoning 🟢📐
  • Language understanding: MMLU, HellaSwag, ARC, WinoGrande, TruthfulQA
  • Reasoning: GSM8K, MATH, BBH, ARC-Challenge
J.2 Code Benchmarks 🟢📐
  • Code: HumanEval, MBPP, SWE-bench, LiveCodeBench
J.3 Retrieval, Embeddings & RAG 🟡📐
  • Retrieval and embeddings: MTEB, BEIR, MS MARCO
  • RAG evaluation: RAGAS metrics, RGB benchmark
J.4 Chat, Instruction & Safety 🟡📐
  • Chat and instruction: AlpacaEval, MT-Bench, Arena-Hard, Chatbot Arena
  • Safety: ToxiGen, RealToxicityPrompts, HarmBench
Reference

Tools & Technologies Used

Key libraries, frameworks, and platforms used throughout the course.

Core ML & LLM

PyTorch Hugging Face Transformers TRL PEFT bitsandbytes Unsloth Axolotl MergeKit TransformerLens

Inference & Serving

vLLM TGI SGLang Ollama llama.cpp Triton

LLM APIs & SDKs

OpenAI SDK Anthropic SDK Google Generative AI LiteLLM AWS Bedrock Azure OpenAI Instructor DSPy

RAG & Vector Search

LangChain LlamaIndex ChromaDB Qdrant Pinecone FAISS pgvector Neo4j

Agents & Orchestration

LangGraph CrewAI AutoGen Smolagents PydanticAI MCP Protocol E2B

Data & Evaluation

Hugging Face Datasets Distilabel Argilla RAGAS DeepEval Weights & Biases Outlines

Observability & Deployment

LangSmith Langfuse Phoenix (Arize) FastAPI Gradio Streamlit Chainlit Docker NeMo Guardrails

NLP & Interpretability

spaCy NLTK Gensim Sentence-Transformers tiktoken TransformerLens nnsight