Building Conversational AI using LLM and Agents: Course Syllabus

Module 00

ML & PyTorch Foundations

▼

Prerequisite refresher covering core machine learning concepts and hands-on PyTorch programming. Ensures all students share a common foundation before diving into NLP and LLMs.

0.1 ML Basics: Features, Optimization & Generalization 🟢📐

Feature engineering and representation
Supervised learning: classification and regression fundamentals
Loss functions and optimization: gradient descent, SGD, mini-batch SGD
Overfitting, underfitting, and regularization (L1, L2, dropout)
Bias-variance tradeoff and generalization theory
Cross-validation and model selection strategies

0.2 Deep Learning Essentials 🟢📐

Neural network fundamentals: perceptrons, MLPs, activation functions
Backpropagation and the chain rule
Batch normalization, dropout, and weight initialization
Convolutional neural networks (CNNs) overview
Training best practices: learning rate scheduling, early stopping, gradient clipping

0.3 PyTorch Tutorial 🟢⚙️🔧

Comprehensive PyTorch introduction
Tensors: creation, indexing, broadcasting, device management (CPU/GPU)
Autograd: automatic differentiation, computational graphs, gradient accumulation
Building models with nn.Module: layers, parameters, forward pass
Data loading: Dataset, DataLoader, transforms, batching
Training loop pattern: forward, loss, backward, optimizer step
Saving and loading models: state_dict, checkpoints
Debugging: hooks, gradient inspection, profiling with torch.profiler
Lab: Build and train an image classifier in PyTorch from scratch; practice tensor operations, custom datasets, and the full training loop

0.4 Reinforcement Learning Foundations 🟢📐

The RL framework: agent, environment, state, action, reward, episode
Policy: mapping states to actions; deterministic vs. stochastic policies
Value functions: state-value V(s), action-value Q(s,a); the Bellman equation (intuition, not derivation)
Policy gradient theorem (intuition): adjusting the policy to increase the probability of actions that led to high rewards
PPO intuition: clipping the policy update to prevent destructive large changes; why this matters for LLM training
How RL connects to LLM training: the LLM is the policy, generating a token is an action, the reward model scores the output
This lesson provides the foundations for Module 16 (RLHF, DPO, RLVR)

Module 01

Foundations of NLP & Text Representation

▼

Build intuition for how machines understand text: from bag-of-words to dense vector spaces. Covers classical and neural word representations that underpin all modern LLM work.

1.1 Introduction to NLP & the LLM Revolution 🟢📐

History of NLP: rule-based → statistical → neural → LLM era
Comprehensive NLP task taxonomy:
- Text classification: sentiment analysis, intent detection, topic categorization, spam filtering
- Sequence labeling: NER, POS tagging, chunking
- Text generation: summarization (extractive vs. abstractive), machine translation, paraphrase generation
- Question answering: extractive QA, generative QA, open-domain QA
- Information extraction: relation extraction, event detection, slot filling
- Semantic tasks: textual entailment, semantic similarity, natural language inference
- Conversational AI: dialogue systems, task-oriented dialogue, open-domain chat
- How LLMs are changing each task: from specialized models to unified generative approaches
Why language is hard: ambiguity, context, compositionality
Course roadmap and project overview

1.2 Text Preprocessing & Classical Representations 🟢⚙️🔧

Text cleaning: Unicode normalization, regex, stop words, stemming, lemmatization
Bag-of-Words, TF-IDF, n-grams
Term vectors and TF-IDF in depth: term frequency saturation, inverse document frequency weighting, document length normalization, vector space model for retrieval
One-hot encoding and its limitations
Lab: Build a text preprocessing pipeline with spaCy and NLTK

1.3 Word Embeddings: Word2Vec, GloVe & FastText 🟢⚙️🔧

Distributional hypothesis: "you shall know a word by the company it keeps"
Word2Vec: CBOW and Skip-gram architectures, negative sampling
GloVe: global matrix factorization approach
FastText: subword-level embeddings
Visualizing embeddings: t-SNE, UMAP, analogy tasks
Lab: Train Word2Vec on a custom corpus using Gensim; explore word analogies

1.4 Contextual Embeddings: ELMo & the Path to Transformers 🟡📐

Limitations of static embeddings (polysemy)
ELMo: bi-directional LSTM-based contextualized representations
Transfer learning in NLP: why pre-train?
Setting the stage for BERT and GPT

Module 02

Tokenization is the critical first step of every LLM pipeline. Understand the algorithms behind BPE, WordPiece, and SentencePiece, and learn how tokenizer choice affects model behavior, cost, and multilingual capability.

2.1 Why Tokenization Matters 🟢📐

From characters to tokens: the vocabulary tradeoff
Impact on context window, cost, and model performance
Tokenization artifacts and edge cases (numbers, code, CJK, emoji)

2.2 Subword Tokenization Algorithms 🟡⚙️🔧

Byte Pair Encoding (BPE): algorithm, merge rules, vocab construction
BPE internals: merge table as priority queue: encoding uses greedy left-to-right merging; the merge tree data structure maps byte sequences to token IDs with O(n·log(n)) encoding complexity
WordPiece (BERT's tokenizer): MaxMatch algorithm, likelihood-based merging vs. frequency-based (BPE)
Unigram model (SentencePiece): probabilistic tokenization: Viterbi decoding finds most likely segmentation; EM training to prune vocabulary from large initial set
Byte-level BPE (GPT-2/GPT-4 style): base-256 vocabulary, no unknown tokens, universal UTF-8 coverage
Comparing tokenizers: tiktoken (Rust/Python, fast), Hugging Face tokenizers (Rust core), SentencePiece (C++)
Tokenizer-free / byte-level models: ByT5 (byte-to-byte), MegaByte (patch-based byte model), character-level approaches: tradeoffs: longer sequences but no vocabulary mismatch
Lab: Train a BPE tokenizer from scratch; visualize the merge tree; implement encoding step-by-step; compare token counts and fertility across models

2.3 Tokenization in Practice & Multilingual Considerations 🟡⚙️🔧

Special tokens: [CLS], [SEP], [PAD], <|endoftext|>, chat templates
Tokenizer configuration for chat models (chat_template, apply_chat_template)
Multilingual tokenization: fertility rates, script coverage, cross-lingual vocab sharing
Multimodal tokenization: how vision and audio tokens work
Estimating token counts for API cost optimization
Lab: Inspect and compare tokenization of the same text (English, Chinese, Arabic, code) across GPT-4, Claude, Llama 3, and Gemma tokenizers

Module 03

Sequence Models & the Attention Mechanism

▼

Trace the evolution from RNNs to the attention mechanism: the key breakthrough that enabled transformers. Build deep intuition for how attention works mathematically and conceptually.

3.1 Recurrent Neural Networks & Their Limitations 🟢📐

RNN fundamentals: hidden state, sequential processing
LSTM and GRU: gating mechanisms
Bidirectional RNNs
Vanishing/exploding gradients and the long-range dependency problem
Encoder-decoder architecture for seq2seq tasks

3.2 The Attention Mechanism 🟡⚙️🔧

Intuition: "where to look" when generating output
Bahdanau (additive) attention
Luong (multiplicative / dot-product) attention
Attention weights as soft alignment
Backpropagation through attention: gradient flow through softmax (Jacobian structure), why attention gradients are dense; gradient of scaled dot-product w.r.t. Q, K, V
Attention as differentiable dictionary lookup: soft retrieval from value memory indexed by key similarity
Lab: Implement Bahdanau attention from scratch in PyTorch; manually compute gradients and verify with autograd; visualize attention heatmaps

3.3 Scaled Dot-Product & Multi-Head Attention 🟡⚙️🔧

Query, Key, Value formulation: linear projections W_Q, W_K, W_V and their learned subspaces
Scaled dot-product attention: why scale by √d_k: variance analysis of dot products in high dimensions
Softmax temperature and attention entropy: sharp vs. diffuse attention distributions
Multi-head attention: parallel subspace projections and the concatenation/projection output
Self-attention vs. cross-attention: when Q and KV come from different sequences
Causal (masked) attention: lower-triangular mask for autoregressive models
Attention complexity: O(n²d) compute, O(n²) memory: understanding the quadratic bottleneck
Lab: Implement multi-head self-attention from scratch with explicit matrix operations; verify against PyTorch nn.MultiheadAttention; visualize attention weight distributions and entropy

Module 04

The Transformer Architecture

▼

Deep dive into the full transformer architecture: the foundation of every modern LLM. Understand every component, from positional encoding to layer normalization, and implement one from scratch.

4.1 Transformer Architecture Deep Dive 🟡📐

"Attention Is All You Need" paper walkthrough: original architecture and design rationale
Encoder and Decoder stacks: layer composition, information flow, residual stream hypothesis
Positional encoding internals: sinusoidal (frequency basis, rotation interpretation), learned embeddings, RoPE (rotation matrices, relative position via complex multiplication), ALiBi (linear bias slopes)
Feed-forward networks: expansion ratio (4x → 8/3x for SwiGLU), role as key-value memories (Geva et al.)
Activation functions: ReLU → GELU → SwiGLU: ablation evidence and why SwiGLU wins
Normalization: LayerNorm vs RMSNorm (computation, gradient flow); Pre-LN vs Post-LN (training stability analysis)
Weight initialization: Xavier/He schemes, scaled initialization for deep transformers, μP (maximal update parametrization)
Loss function: cross-entropy for next-token prediction, label smoothing, auxiliary losses
📐 Information theory for language modeling: entropy as the theoretical lower bound on compression; cross-entropy loss is an upper bound on true entropy; perplexity = 2^(cross-entropy) measures how "surprised" the model is; KL divergence measures distribution mismatch between model and data; mutual information quantifies how much context reduces uncertainty about the next token
Residual connections as gradient highways: why transformers train better than deep RNNs

4.2 Build a Transformer from Scratch 🟡⚙️🔧

Implement a complete decoder-only transformer in PyTorch (~300 lines)
Token embeddings + RoPE positional encoding
Multi-head causal self-attention layer
Feed-forward layer with SwiGLU
Training loop on a small text corpus
Lab: Train a BPE-level mini-GPT; generate text samples; profile memory and compute

4.3 Transformer Variants & Efficiency 🔴📐

Encoder-only (BERT), Decoder-only (GPT), Encoder-Decoder (T5, BART): architectural comparison and when to use each
Encoder-decoder deep dive: cross-attention mechanism (queries from decoder, keys/values from encoder); T5 text-to-text framework; BART denoising pre-training; seq2seq fine-tuning for summarization, translation, and structured generation; why encoder-decoder excels at conditional generation tasks vs. decoder-only
Efficient attention: Flash Attention, Multi-Query Attention (MQA), Grouped-Query Attention (GQA)
Multi-head Latent Attention (MLA) as a general efficient attention technique: projects keys and values into a low-rank latent space before caching; reduces KV cache by 10x+ compared to standard MHA; mathematically: K_cache = W_down @ K (compress), K_restored = W_up @ K_cache (decompress at attention time); comparison with GQA and MQA: MLA achieves better quality at similar cache sizes
Sparse attention: Longformer, BigBird patterns
Linear attention and state-space models (Mamba, RWKV, Jamba)
🔬 State Space Models in depth: the S4 lineage (S4, S5, S6/Mamba); continuous-time ODE formulation discretized into linear recurrences; the HiPPO framework for long-range dependency initialization; Mamba's selection mechanism (input-dependent state transitions, making the model data-dependent unlike fixed SSMs); Mamba-2 structured state space duality (connecting SSMs to attention); hybrid architectures: Jamba (Mamba + attention layers), Zamba; when SSMs match or beat transformers (long sequences, low latency) and when they fall short (in-context learning, complex retrieval)
Mixture of Experts (MoE) internals: expert FFN layers, gating network (top-k routing), auxiliary load-balancing loss, expert capacity factor; DeepSeek MoE: shared experts + routed experts, fine-grained expert segmentation
🔬 Computational complexity of attention: O(n²d) time and O(n²) memory for standard attention; theoretical lower bounds (Fine-Grained Complexity perspective, Strong Exponential Time Hypothesis implications); whether sub-quadratic attention can be provably equivalent to full attention; IO complexity analysis underlying FlashAttention
🔬 RWKV architecture internals: the WKV (Weighted Key-Value) mechanism replacing attention with linear-complexity recurrence; time-decay factors creating position-aware token mixing; RWKV-5/6 improvements (multi-headed WKV, better gating); transformer-quality training parallelism with RNN-efficiency inference; comparison with Mamba on standard benchmarks
🔬 Recent sparse attention advances: ring attention for distributed long-context across multiple GPUs (sequence parallelism); blockwise parallel decoding; learned sparse patterns vs. fixed patterns; theoretical framework for which sparse patterns preserve expressiveness vs. lose information
🔬 Gated Attention (NeurIPS 2025 Best Paper): applying a learnable sigmoid gate after scaled dot-product attention; enables non-linearity, sparsity, and attention-sink-free inference; deployed in Qwen3-Next; Gated DeltaNet combines gated linear attention with gated softmax attention for hybrid architectures
🔬 Attention architecture evolution: MHA (original) → GQA (shared KV heads for cache reduction) → MLA (low-rank KV projection) → Gated Attention (sigmoid gate for sparsity) → hybrid architectures combining softmax attention with linear attention (Gated DeltaNet, Jamba); each step trades expressiveness for efficiency

4.4 GPU Fundamentals & Systems for Transformers 🔴⚙️🔧

GPU architecture internals: streaming multiprocessors (SMs), warps (32 threads), thread blocks
Memory hierarchy: registers → SRAM (shared memory, ~20MB) → HBM (global, ~80GB) → host DRAM
Memory bandwidth vs. compute: arithmetic intensity and the roofline model
Why attention is memory-bound: the IO complexity analysis
FlashAttention internals: tiling the QK^T computation, online softmax algorithm, avoiding materialization of the n×n attention matrix
FlashAttention-2/3: warp-level optimizations, FP8 support
Kernel fusion: combining operations to reduce memory round-trips
Triton: writing custom GPU kernels in Python: matrix multiply, fused attention
Resource accounting: FLOPs per token = 6ND (forward+backward), memory = 2P (params) + optimizer states + activations + KV cache (KV cache: stored key/value tensors from previous tokens that avoid recomputation during generation; covered in depth in Module 8.2)
Lab: Write a simple Triton kernel for fused softmax; benchmark against PyTorch native; calculate FLOPs and memory for a 7B model training run

4.5 Transformer Expressiveness Theory 🔴📐

🔬 Universal approximation results for transformers
🔬 What fixed-depth transformers can compute (bounded-depth threshold circuits, TC^0)
🔬 Why chain-of-thought extends computational power (chain-of-thought: prompting the model to show intermediate reasoning steps before the final answer; covered in depth in Module 10.2) (transformers + CoT can simulate arbitrary Turing machines)
🔬 Depth vs. width tradeoffs for expressiveness
🔬 Implications: some problems provably require CoT (they exceed the computational class of single-pass transformers)

Module 05

Decoding Strategies & Text Generation

▼

Understand how LLMs generate text token-by-token. Master the algorithms that control quality, diversity, and speed of generation: from greedy search to speculative decoding. Identified as a gap vs. Stanford CS336 and CMU ANLP.

5.1 Deterministic Decoding Strategies 🟡⚙️🔧

Greedy decoding: simplest but suboptimal
Beam search: exploring multiple hypotheses
Length normalization and length penalty
Constrained beam search: forcing specific tokens/patterns
Lab: Implement greedy and beam search from scratch; compare output quality on summarization

5.2 Stochastic Sampling Methods 🟡⚙️🔧

Temperature scaling: sharpening and flattening distributions
Top-k sampling
Nucleus (top-p) sampling
Min-p sampling: adaptive threshold
Typical decoding and eta sampling
Repetition penalty, frequency penalty, presence penalty
Lab: Implement all sampling methods; generate text at various temperatures and visualize token probability distributions

5.3 Advanced Decoding & Structured Generation 🔴⚙️

🔬 Contrastive decoding: amateur vs. expert model
🔬 Classifier-free guidance for language models
Grammar-constrained decoding (Outlines, Guidance, LMQL)
JSON schema enforcement at the logit level
Watermarking generated text: detection and robustness
🔬 Minimum Bayes Risk (MBR) decoding: sample N candidates, select the one minimizing expected risk under a utility metric (e.g., LLM-judge score, ROUGE, BERTScore); outperforms greedy and best-of-N decoding (ICLR 2025); practical tradeoff: N samples × utility evaluation cost vs. quality gain

5.4 Diffusion-Based Language Models 🔴🔬

🔬 Discrete diffusion for text: MDLM, SEDD, LLaDA, Dream
The forward process adds noise to token embeddings; reverse process denoises to generate
Parallel token generation (all tokens simultaneously, not autoregressive)
🔬 Gemini Diffusion paradigm
Advantages: order-of-magnitude latency reduction for long outputs
Limitations: quality gap vs. autoregressive for complex reasoning
🔬 TraceRL (ICLR 2026): RL post-training for diffusion LLMs

Module 06

Pre-training, Scaling Laws & Data Curation

▼

Understand how LLMs are trained at scale: pre-training objectives, data curation pipelines, scaling laws, and the computational infrastructure behind modern foundation models. Expanded with deeper treatment of scaling laws and data curation per Stanford CS336.

6.1 The Landmark Models 🟢🔬

BERT and its variants (RoBERTa, DeBERTa, ALBERT)
GPT series: GPT-1 → GPT-2 → GPT-3 → InstructGPT → GPT-4
T5 and the text-to-text framework
Emergence: in-context learning, chain-of-thought reasoning

6.2 Pre-training Objectives & Paradigms 🟡🔬

Causal language modeling (CLM): next-token prediction (GPT family)
Masked language modeling (MLM): BERT, RoBERTa
Span corruption / denoising: T5, UL2
Prefix LM: PaLM, GLM
Fill-in-the-middle (FIM) for code models
🔬 Multi-token prediction: training models to predict multiple future tokens simultaneously (Meta, 2024); architecture: shared trunk with N independent prediction heads; benefits: improved sample efficiency, better representations of long-range dependencies, natural fit for speculative decoding; used in DeepSeek V3 training; challenges: increased memory during training, diminishing returns beyond 4 tokens

6.3 Scaling Laws & Compute-Optimal Training 🔴⚙️🔧

Kaplan scaling laws: loss as a function of N, D, C
Chinchilla laws: compute-optimal data/parameter ratios
Data-constrained scaling: what happens when you run out of data?
Over-training small models (Llama approach): trading compute for inference cost
Predicting loss from compute budget: practical use of scaling laws
Emergent capabilities and phase transitions
🔬 The emergent abilities debate: Schaeffer et al. (2023) argued emergent abilities are a "mirage" caused by nonlinear metric choices (switching from accuracy to log-likelihood makes transitions smooth); counterarguments: some capabilities genuinely appear discontinuously on continuous metrics; implications for AI safety: if capabilities are unpredictable, governance is harder; implications for scaling decisions: if capabilities are smooth, we can extrapolate
Lab: Fit scaling law curves to mini-model training runs; predict loss for a target model size

6.4 Data Curation at Scale 🔴⚙️🔧

Pre-training data sources: Common Crawl, The Pile, RedPajama, FineWeb, DCLM
Web crawling and text extraction pipelines
Deduplication: exact (hash), near-duplicate (MinHash/SimHash), fuzzy
Quality filtering: heuristic rules, perplexity scoring, classifier-based
Data mixing: domain proportions and their impact on capabilities
Toxicity and PII removal at scale
🔬 Data pruning: removing low-value training examples to reduce compute without quality loss; influence functions: tracing model predictions back to specific training examples (which training points most affect this output?); TRAK and datamodels for efficient attribution at scale; membership inference attacks as attribution tools; connection to copyright litigation (NYT v. OpenAI) and GDPR data subject requests; practical use: debugging model failures by identifying problematic training data
Lab: Build a mini data curation pipeline: crawl → extract → deduplicate → filter → quality-score using FineWeb tools

6.5 Optimizers & Training Dynamics 🔴🔬

Adam optimizer internals: first/second moment estimation, bias correction, memory cost (2× model params)
AdamW: decoupled weight decay: why it matters for transformers
Memory-efficient optimizers: Adafactor (factored second moments), 8-bit Adam, LION (sign-based)
Learning rate schedules: warmup necessity (preventing early divergence), cosine decay with restarts
Gradient accumulation: simulating large batch sizes: interaction with batch norm and LR
Training dynamics: loss landscape geometry, sharp vs. flat minima, grokking phenomenon
Training instabilities: loss spikes, NaN gradients: root causes and mitigations (z-loss, gradient clipping)

6.6 Distributed Training at Scale 🔴⚙️🔧

Collective communication primitives: all-reduce, all-gather, reduce-scatter: ring vs. tree topologies
Data parallelism (DDP): replicated model, gradient all-reduce, synchronized SGD
Fully Sharded Data Parallelism (FSDP): parameter sharding, forward/backward gather/scatter lifecycle
ZeRO optimization stages: Stage 1 (optimizer states) → Stage 2 (+gradients) → Stage 3 (+parameters)
Tensor parallelism: column/row splitting of linear layers, all-reduce placement
Pipeline parallelism: micro-batching, 1F1B schedule, pipeline bubbles
Mixed precision: FP16 (loss scaling needed), BF16 (range preserved), FP8 (Hopper GPUs)
FP8 training at scale: DeepSeek V3 demonstrated successful FP8 mixed-precision training at 671B parameters (first large-scale demonstration); E4M3 for forward pass activations, E5M2 for gradients; per-tensor dynamic scaling to prevent overflow; 2x memory reduction and higher throughput vs. BF16 with minimal quality loss
Gradient checkpointing: recomputing activations to trade compute for memory: optimal checkpoint placement
Data loading pipeline: tokenized data sharding, weighted sampling across domains, curriculum
Lab: Train a small model with FSDP across multiple GPUs; compare DDP vs. FSDP memory footprint; profile communication overhead

6.7 In-Context Learning Theory 🔴🔬

🔬 The mystery: how do transformers learn from examples in the prompt without gradient updates?
🔬 Transformers as implicit meta-learners: the Bayesian interpretation (Xie et al. 2022)
🔬 In-context learning as implicit gradient descent (Akyurek et al. 2023, Von Oswald et al. 2023)
🔬 Task vectors: how in-context examples shift internal representations toward task-relevant subspaces
🔬 Mesa-optimization: are transformers learning optimization algorithms internally?
Limitations: when in-context learning fails (distribution shift, complex reasoning, long contexts)
Connection to few-shot prompting practice: why example selection and ordering matter

Module 07

Modern LLM Landscape & Model Internals

▼

Survey the current state of LLMs, both closed and open-source, and understand the architectural innovations, reasoning capabilities, and multilingual dimensions of modern models.

7.1 Closed-Source Frontier Models 🟢🔬

OpenAI: GPT-4o, o1/o3: reasoning models and chain-of-thought
Anthropic: Claude 3.5 Sonnet, Claude 4 family: constitutional AI, long context
Google: Gemini 2.0 / 2.5: native multimodality, million-token context, "thinking" mode
xAI Grok, Cohere Command R+, Mistral Large: second-tier frontier models
Comparing capabilities, pricing tiers, rate limits, and context windows

7.2 Open-Source & Open-Weight Models 🟡⚙️🔧

Meta Llama 3 / 3.1 / 4: architecture, training, chat fine-tuning
Mistral, Mixtral (MoE), Mistral Large
Google Gemma 2 / 3
Qwen 2.5, DeepSeek-V3 / R1: MoE and reasoning
DeepSeek V3 architecture innovations: Multi-head Latent Attention (MLA) compresses KV cache by projecting keys/values into a low-rank latent space, reducing cache by 10x+; FP8 mixed-precision training at 671B parameters (first successful large-scale FP8 training); auxiliary-loss-free MoE load balancing using bias terms instead of loss penalties; multi-token prediction training objective (predict next N tokens simultaneously)
Microsoft Phi-3 / Phi-4: small but capable models via knowledge distillation
Recent: Llama 4 (MoE, native multimodal), Gemma 3 (vision), DeepSeek-R1 (open reasoning)
Specialized: CodeLlama, StarCoder2, Whisper, LLaVA
The Hugging Face ecosystem: Model Hub, Transformers, Datasets, Spaces
Lab: Download and run Llama 3 8B locally; compare output quality with a 70B model via API

7.3 Reasoning & Test-Time Compute 🔴⚙️🔧

Inference-time scaling: the paradigm shift from train-time to test-time compute
Chain-of-thought at scale: o1/o3, DeepSeek-R1 internals
Process reward models (PRMs) vs. outcome reward models (ORMs)
Best-of-N sampling with reward-guided selection
Monte Carlo Tree Search for language: LATS, AlphaProof approach
Compute-optimal inference: when to think longer vs. use a bigger model
Lab: Implement best-of-N with a reward model; compare accuracy vs. compute on math reasoning tasks

7.4 Multilingual & Cross-Cultural LLMs 🟡🔬

Multilingual pre-training: cross-lingual transfer, curse of multilinguality
Low-resource language challenges and solutions
Cultural bias in LLMs: Western-centric defaults, evaluation across cultures
Multilingual evaluation benchmarks and metrics
Adapting English-centric models to new languages (continued pre-training, vocabulary extension)

Module 08

Inference Optimization & Efficient Serving

▼

Master the techniques that make LLM inference fast and affordable: from quantization and KV cache optimization to speculative decoding and high-throughput serving. Identified as a gap vs. Stanford CS336 and CMU ANLP.

8.1 Model Quantization 🟡⚙️🔧

Quantization math: mapping float → int: absmax (symmetric: q = round(x / max|x| × 2^n-1)), zero-point (asymmetric: shift + scale), per-tensor vs. per-channel vs. per-group granularity
Data types: INT8, INT4, FP8 (E4M3, E5M2), NF4 (normal-float: quantile-based 4-bit, optimal for normally-distributed weights)
Calibration strategies: how to choose quantization parameters: min/max, percentile, MSE-minimizing, cross-entropy-minimizing
Post-training quantization: GPTQ (layer-wise Hessian-based optimal rounding), AWQ (activation-aware: protect salient weight channels), GGUF (llama.cpp format, mixed-precision per tensor)
Quantization-aware training: simulated quantization during forward pass, straight-through estimator for gradients
bitsandbytes: 4-bit and 8-bit loading with automatic mixed-precision; NF4 + double quantization for QLoRA
Quality degradation analysis: perplexity vs. bit width curves, task-specific sensitivity, outlier features
Lab: Quantize a 7B model to 4-bit with GPTQ and AWQ; compare perplexity, generation quality, inference speed, and memory at INT8/INT4/NF4

8.2 KV Cache & Memory Optimization 🔴⚙️🔧

The KV cache explained: storing key/value tensors from all previous tokens to avoid recomputation
KV cache data structure: tensor of shape [batch, num_heads, seq_len, head_dim] per layer: memory formula: 2 × layers × heads × seq_len × head_dim × dtype_size
Why inference is memory-bandwidth-bound: low arithmetic intensity during generation
PagedAttention internals: virtual memory analogy: block tables map logical KV positions to physical GPU memory blocks; eliminates fragmentation and enables memory sharing across sequences
KV cache compression: INT8/INT4 quantization of cached values, H2O eviction (Heavy-Hitter Oracle), sliding window attention, StreamingLLM (attention sinks)
MQA vs. GQA vs. MHA: sharing K,V heads reduces cache by N×; GQA as the modern compromise (Llama 2/3)
Prefix caching: RadixAttention tree for sharing cached prefixes across requests: data structure and lookup
Continuous batching: dynamically adding/removing sequences mid-batch: iteration-level vs. request-level scheduling
🔬 Test-Time Training (TTT): compressing long context into model weights via continued next-token-prediction at inference time; TTT layers replace attention with a learned update rule applied during inference; achieves 35x speedup over full attention at 2M context; blurs the line between training and inference
🔬 DeepSeek Sparse Attention (DSA): hierarchical two-stage sparse attention pipeline (Lightning indexer for coarse selection, then fine-grained token selection); reduces inference cost by approximately 70% for long contexts; introduced in DeepSeek V3.2
Lab: Calculate KV cache size for Llama 3 8B/70B at various context lengths; profile memory with vLLM; implement prefix caching and measure throughput gain

8.3 Speculative Decoding 🔴⚙️🔧

Speculative decoding principle: draft γ tokens with fast model, verify all γ in a single forward pass of target model: mathematically guaranteed to match target distribution
Acceptance/rejection: compare draft token probabilities p(x) with target q(x); accept with probability min(1, q(x)/p(x)); reject and resample from adjusted distribution
Draft model selection: separate small model, self-speculative (layer skipping), n-gram lookup, retrieval-based
EAGLE: feature-level autoregression: predicting hidden states, not tokens; tree-structured verification for parallel candidate evaluation
Medusa: multiple prediction heads on top of target model: each head predicts k-th future token
Token tree verification: batched verification of multiple candidate sequences in a single forward pass using tree attention masks
When speculative decoding helps: high draft acceptance rate (>70%), latency-sensitive single-request, target model is bandwidth-bound
Lab: Implement speculative decoding from scratch with rejection sampling; benchmark speedup with different draft models; measure acceptance rates

8.4 Serving Infrastructure 🟡⚙️🔧

vLLM: high-throughput serving with continuous batching
TGI (Text Generation Inference) by Hugging Face
SGLang: optimized runtime with RadixAttention
TensorRT-LLM: NVIDIA's inference engine with hardware-level GPU optimization; 30-50% higher throughput than vLLM at high concurrency
LMDeploy: inference engine with TurboMind backend; competitive quantization support
Ollama and llama.cpp for local inference
Triton Inference Server for production
Benchmarking: throughput (tokens/sec), latency (TTFT, TPS), concurrency
Lab: Deploy vLLM and TGI side-by-side; benchmark throughput and latency under load

Module 09

Working with LLM APIs

▼

Master the practical skills of calling, configuring, and optimizing LLM APIs from all major providers.

9.1 OpenAI API Deep Dive 🟢⚙️🔧

Chat Completions API: messages, roles (system/user/assistant), parameters
Temperature, top_p, max_tokens, frequency/presence penalty
Streaming responses with SSE
Function calling / tool use
Structured Outputs (JSON mode, response_format)
Batch API for cost reduction
Lab: Build a multi-turn chatbot with function calling using the OpenAI Python SDK

9.2 Anthropic, Google & Other Provider APIs 🟢⚙️🔧

Anthropic Messages API: system prompts, prompt caching, tool use, extended thinking
Google Gemini API: generateContent, grounding, code execution
AWS Bedrock: unified access to multiple model providers
Azure OpenAI: enterprise deployment patterns
API comparison: feature parity, pricing, rate limits
Lab: Implement the same task across OpenAI, Anthropic, and Gemini APIs; compare results and cost

9.3 API Abstraction Layers & Routing 🟡⚙️🔧

LiteLLM: unified interface for 100+ LLM providers
OpenAI-compatible APIs: standardization pattern
OpenRouter: model routing and fallback
Cost tracking, rate limiting, and retry strategies
Production LLM error handling patterns: circuit breaker pattern (failover when provider returns errors for extended periods); timeout management (separate TTFT timeout from total generation timeout); error taxonomy: 429 (rate limit, exponential backoff with jitter), context length exceeded (truncate and retry), content filter triggered (rephrase), malformed tool call JSON (retry with stricter schema); graceful degradation (cached responses, simpler model fallback, static FAQ when LLM unavailable)
Caching strategies: semantic caching, prompt caching
Semantic cache implementation: embed incoming query, similarity search against cached query-response pairs (cosine threshold typically 0.95+), return cached response if match found; cache invalidation strategies (TTL, source document change detection); tools: GPTCache, Redis with vector search
Token budget enforcement: per-user/organization token tracking, hard/soft spending limits, cost alerting on anomalous usage spikes, per-feature cost attribution dashboards
AI gateways for production: Portkey (routing, fallbacks, spend tracking, caching, guardrails across 1600+ LLMs), Helicone (open-source observability proxy with request logging and cost tracking)
Lab: Build a provider-agnostic LLM client with automatic fallback and cost tracking

Module 10

Prompt Engineering & Advanced Techniques

▼

Prompting is programming with natural language. Learn systematic techniques from basic few-shot to advanced reasoning chains, reflection patterns, and automated prompt optimization.

10.1 Fundamentals of Prompt Engineering 🟢⚙️🔧

Zero-shot, one-shot, and few-shot prompting
System prompts and role assignment
Instruction clarity: specificity, constraints, output format
Prompt templates and variable injection
Handling edge cases: refusals, hallucinations, verbosity
Lab: Iteratively refine prompts for a classification task; measure accuracy improvements

10.2 Advanced Reasoning Strategies 🟡⚙️🔧

Chain-of-Thought (CoT) prompting and its variants
Self-consistency: sampling multiple reasoning paths and majority voting
Tree-of-Thought (ToT): structured exploration with backtracking
Step-back prompting: abstraction before reasoning
Meta-prompting and prompt chaining
Lab: Implement CoT, self-consistency, and ToT for math reasoning; compare accuracy

10.3 Reflection & Self-Critique Patterns 🟡⚙️🔧

Reflection as a first-class design pattern (per Andrew Ng's framework) (see also Module 21.1 for reflection as an agentic architecture pattern)
Self-evaluation: having the LLM critique its own output
Iterative refinement loops: generate → critique → revise
Constitutional AI-style self-checks at prompt-time
Reflexion: memory-augmented self-reflection over multiple attempts
When reflection helps vs. when it's compute-wasteful
Lab: Build a reflection loop for code generation: generate → test → reflect on errors → fix; measure pass@1 improvement

10.4 Structured Output & Automated Prompt Optimization 🟡⚙️🔧

JSON mode and schema enforcement
Pydantic models for output validation (Instructor library)
Automatic prompt optimization: DSPy, OPRO
Prompt versioning, A/B testing, and regression testing
Lab: Use Instructor + Pydantic to extract structured data; then use DSPy to auto-optimize a multi-step prompt pipeline

Module 11

Hybrid ML+LLM Architectures & Decision Frameworks

▼

In production, LLMs rarely work alone. Learn when to use an LLM vs. classical ML, how to combine them in hybrid architectures, and how to make principled cost-performance tradeoffs. Addresses the #1 gap identified across all three executive perspectives.

11.1 When NOT to Use an LLM 🟡⚙️🔧

The LLM decision framework: accuracy vs. latency vs. cost vs. interpretability: when classical ML wins
Classification: TF-IDF + logistic regression at 0.001x cost vs. GPT-4: when each is appropriate
Named Entity Recognition: spaCy/CRF vs. LLM extraction: speed and accuracy tradeoffs (see Module 11.5 for full IE treatment)
Tabular prediction: XGBoost/LightGBM vs. LLM: structured data is still king for classical ML
Regex and rule-based extraction: when deterministic rules beat stochastic LLM outputs
Cost modeling: calculating per-query cost at scale for LLM vs. classical approaches ($0.001 vs. $0.00001)
Lab: Benchmark the same classification task with TF-IDF+LR, fine-tuned BERT, GPT-4 few-shot, and fine-tuned Llama: compare accuracy, latency, cost, and reliability

11.2 Hybrid ML+LLM Architectures 🔴⚙️🔧

Pattern: LLM as feature extractor: use LLM to generate embeddings or structured features, feed into XGBoost/neural net for final prediction
Pattern: Classical triage → LLM escalation: cheap model handles 80% of cases, LLM handles the complex 20%
Pattern: LLM-powered feature engineering: generate text descriptions of structured data, enrich sparse features with LLM reasoning
Pattern: Ensemble: classical model + LLM vote, confidence-weighted combination
Pattern: LLM → structured pipeline: LLM extracts entities/intent, downstream classical system executes (e.g., NLU → slot-filling → API call)
Pattern: Classical NLP pre-filter + LLM: regex/keyword filter reduces candidates, LLM does semantic analysis on survivors
Cascading model architectures: small model → medium model → large model with confidence-based routing
Lab: Build a customer support system where a classifier routes tickets, an LLM extracts structured info from complex ones, and a rules engine executes the resolution

11.3 LLMs for Time Series & Forecasting 🟡⚙️

🔬 LLM-native time series models: TimeGPT, Chronos (Amazon), Lag-Llama, Moirai: architectures and capabilities
🔬 Zero-shot forecasting: pre-trained time series foundation models vs. ARIMA/Prophet
LLM-powered anomaly explanation: detecting anomalies with classical methods, explaining them with LLMs
Multimodal time series: combining numerical data with text context (news, reports) for enriched forecasting
Limitations: when statistical models still dominate (short series, simple seasonality, high-frequency data)

11.4 Cost-Performance Optimization at Scale 🔴⚙️🔧

Total Cost of Ownership (TCO) modeling: API costs + infrastructure + engineering time + maintenance
LLM cost optimization patterns: prompt caching, semantic caching, model routing (small→large), batch processing
Latency budgets: decomposing end-to-end latency across retrieval, LLM inference, and post-processing
Quality-cost Pareto frontier: plotting accuracy vs. cost for different model configurations
Build vs. buy analysis: self-hosted open-source vs. API provider: breakeven calculations based on volume
Lab: Build a model router that sends simple queries to a small model and complex queries to GPT-4; measure cost savings vs. quality loss

11.5 Information Extraction with LLMs 🟡⚙️🔧

The IE task landscape: Named Entity Recognition (NER), relation extraction, event extraction, coreference resolution, slot filling
Classical IE pipeline: rule-based → CRF/BiLSTM-CRF → fine-tuned BERT for NER; spaCy, Flair, Stanza
LLM-based IE: zero-shot and few-shot extraction with structured output (JSON mode); prompt design for entity and relation extraction
Hybrid IE: classical NER for high-recall extraction, LLM for disambiguation, normalization, and complex relations
Structured output enforcement: Pydantic models, JSON schema constraints, Instructor library, BAML
Evaluation: entity-level F1 (strict vs. partial match), relation extraction metrics, error analysis patterns
Production IE patterns: batch extraction from document corpora, incremental knowledge base population, quality monitoring
Lab: Build an IE pipeline that extracts entities and relationships from the project dataset using both spaCy NER and LLM few-shot extraction; compare precision, recall, and cost

Module 12

Synthetic Data Generation & LLM Simulation

▼

Synthetic data is the backbone of this course's project. Learn to generate high-quality, diverse, and domain-specific datasets, and use LLMs as simulators for evaluation and testing.

12.1 Principles of Synthetic Data Generation 🟡⚙️

Why synthetic data: cost, privacy, coverage, scale
Types: instruction data, conversation data, preference pairs, domain data
Quality dimensions: diversity, accuracy, consistency, naturalness
Risks: model collapse, bias amplification, data contamination
🔬 LLM output homogeneity problem (NeurIPS 2025): studies across 70+ models reveal pronounced intra-model and inter-model homogenization of creative content; implications for synthetic data (model collapse risk when training on LLM-generated data); mitigation: diversity-promoting decoding, temperature tuning, persona-driven generation
Legal and ethical considerations

12.2 LLM-Powered Data Generation Pipelines 🟡⚙️🔧

Self-Instruct and Evol-Instruct (WizardLM) approaches
Generating instruction-response pairs with seed tasks
Multi-turn conversation synthesis
Persona-driven generation for diversity
Domain-specific data generation strategies
Using LLMs to generate preference/ranking data (for RLHF/DPO, covered in Module 16)
Lab: Build a pipeline to generate 10K synthetic customer support conversations using persona templates and quality filters

12.3 LLM-as-Simulator & Evaluation Generation 🟡⚙️🔧

Simulating users: generating realistic interaction patterns
Synthetic test set generation for RAG evaluation
Red-teaming data generation: adversarial prompt synthesis
Synthetic A/B test scenarios for LLM applications
LLM-based evaluation harness generation
Lab: Generate a synthetic evaluation suite for the project: test questions, expected answers, edge cases, and adversarial inputs

12.4 Quality Assurance & Data Curation 🟡⚙️🔧

Automated quality scoring with LLM-as-judge
Deduplication: exact, near-duplicate (MinHash), semantic
Filtering: length, language, toxicity, topic relevance
Argilla for data labeling and review
Distilabel for scalable synthetic data pipelines
Lab: Build a quality-scored synthetic data pipeline using Distilabel; curate a fine-tuning dataset

12.5 LLM-Assisted Labeling & Active Learning 🟡⚙️🔧

LLM pre-labeling: using LLMs to generate initial labels for human review: 5-10x annotation speedup
Confidence-based routing: LLM labels high-confidence samples automatically, humans label uncertain ones
Active learning with LLMs: selecting the most informative samples for human annotation using uncertainty sampling and diversity sampling
Annotation tools: Label Studio, Prodigy, Argilla: LLM integration patterns
Annotation guideline generation: using LLMs to draft and iterate on labeling instructions
Quality control: inter-annotator agreement (Cohen's κ), LLM-vs-human agreement tracking, label noise detection
Lab: Build an LLM-in-the-loop labeling pipeline: LLM pre-labels → confidence routing → human review in Argilla → fine-tuning dataset

12.6 Weak Supervision & Programmatic Labeling 🟡⚙️

Weak supervision fundamentals: labeling functions, noise-aware models, and the Snorkel paradigm
Writing labeling functions: heuristics, pattern matching, knowledge bases, pre-trained models as weak sources
Label aggregation: majority voting, generative label models, handling conflicts and abstentions
Combining weak supervision with LLM-generated labels for scalable annotation
When to use weak supervision vs. LLM labeling vs. human annotation: cost and quality tradeoffs

Project Milestone: Generate the synthetic conversational dataset (10K+ examples) that will be used throughout the rest of the course for fine-tuning, RAG, and agent building. Include multi-turn dialogues, tool-use examples, preference pairs, and evaluation test sets.

Module 13

Fine-Tuning Fundamentals

▼

Learn the complete workflow of fine-tuning LLMs: from data preparation and formatting to training, monitoring, and evaluating adapted models.

13.1 When and Why to Fine-Tune 🟡⚙️

Prompting vs. RAG vs. fine-tuning: decision framework
Use cases: style/tone, domain knowledge, output format, latency, cost
Full fine-tuning vs. parameter-efficient methods
Catastrophic forgetting and how to mitigate it
Continual pre-training vs. instruction fine-tuning

13.2 Data Preparation for Fine-Tuning 🟡⚙️🔧

Dataset formats: Alpaca, ShareGPT, ChatML, conversational
Chat templates and tokenizer configuration
Train/validation/test splits for LLMs
Data mixing and balancing strategies
Packing sequences for efficient training
Lab: Prepare the synthetic dataset from Module 12 into Hugging Face Datasets format with proper chat templates

13.3 Supervised Fine-Tuning (SFT) 🟡⚙️🔧

Full fine-tuning with Hugging Face Trainer / TRL
Hyperparameters: learning rate, batch size, warmup, weight decay, epochs
Learning rate schedulers: cosine, linear, constant with warmup
Gradient accumulation for large effective batch sizes
Monitoring with Weights & Biases, TensorBoard
Lab: Fine-tune a Llama 3 8B model on synthetic data using TRL's SFTTrainer; track metrics in W&B

13.4 Fine-Tuning via Provider APIs 🟡⚙️🔧

OpenAI fine-tuning API: data format, training, deployment
Google Vertex AI model tuning
Trade-offs: ease vs. control vs. cost
Lab: Fine-tune GPT-4o-mini via OpenAI API on synthetic data; compare with locally fine-tuned model

13.5 Fine-Tuning for Representation Learning 🟡⚙️

Why fine-tune for representations: domain shift, specialized similarity, clustering quality
Choosing the base model: encoder-only (BERT family) vs. decoder-only (LLM2Vec approach) for embeddings
When to fine-tune embeddings vs. use off-the-shelf: domain specificity thresholds
Full treatment of embedding training (losses, hard negatives, Sentence-Transformers API, labs) is in Module 18.1

13.6 Fine-Tuning for Classification & Sequence Tasks 🟡⚙️🔧

Adding classification heads to pre-trained models: linear probe vs. full fine-tuning
Single-label classification: sentiment, intent, topic; multi-label classification: tagging, multi-intent detection
Token classification: NER, POS tagging; adding per-token classification heads
Sequence-pair tasks: entailment, similarity, question-answer relevance
Practical considerations: class imbalance (weighted loss, oversampling), threshold tuning for multi-label, calibration
Hugging Face AutoModelForSequenceClassification, AutoModelForTokenClassification: practical API walkthrough
Lab: Fine-tune BERT for intent classification and a decoder model (Llama) for the same task; compare accuracy, latency, and cost

13.7 Adapting Models for Long Text 🔴⚙️🔧

The long context challenge: why models trained on 4K tokens struggle at 32K+
Context extension techniques: RoPE scaling (linear, NTK-aware, YaRN), position interpolation, dynamic NTK
Continued pre-training for long context: LongRoPE, LongLoRA approaches (LoRA is introduced in Module 14.1)
Chunking strategies for long documents: hierarchical processing, map-reduce summarization, sliding window with overlap
Lost-in-the-middle phenomenon: why models attend poorly to middle context; mitigation strategies (reordering, recursive summarization)
Practical tradeoffs: memory scaling (O(n^2) attention), inference latency at long contexts, quality degradation curves
Llama 4 Scout 10M token context window: architectural innovations enabling extreme context (iRoPE: interleaved RoPE with some layers using no positional encoding, enabling infinite context extrapolation); early-fusion multimodal approach processing images and text jointly from the first layer
Lab: Compare model performance on a QA task at 4K, 16K, and 64K context lengths; implement chunking and map-reduce as alternatives to long-context models

Module 14

Parameter-Efficient Fine-Tuning (PEFT)

▼

Train large models on consumer hardware by only updating a fraction of parameters. Master LoRA, QLoRA, and other PEFT methods that democratize fine-tuning.

14.1 LoRA & QLoRA 🟡⚙️🔧

Low-Rank Adaptation math: W' = W + BA where B ∈ ℝ^d×r, A ∈ ℝ^r×d: freezing W, training only B and A
Why low-rank works: weight update matrices during fine-tuning have low intrinsic rank (Aghajanyan et al.)
Rank (r): tradeoff between capacity and efficiency: typically r=8-64 vs. d=4096
Alpha (α) and scaling: α/r scaling factor: why it matters for learning rate transfer across ranks
Target modules: which linear layers to adapt (q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj)
QLoRA internals: NF4 data type (quantile-based 4-bit), double quantization (quantizing the quantization constants), paged optimizers for memory spikes
Merging: W_merged = W + (α/r) × BA: lossless for inference, no additional latency
Hugging Face PEFT library: config, model wrapping, saving/loading adapters
Lab: Fine-tune Llama 3 8B with QLoRA on a single GPU; inspect adapter weight matrices; merge and compare quality with full fine-tune

14.2 Advanced PEFT Methods 🔴⚙️

DoRA: Weight-Decomposed Low-Rank Adaptation
LoRA+: improved learning rate scheduling for LoRA
Prefix Tuning, P-Tuning: prepending trainable embeddings to hidden states
Prompt Tuning in depth: learning soft prompt tokens that are prepended to the input; comparison with discrete prompt search; scaling behavior showing prompt tuning matches fine-tuning as model size grows
Adapter layers (Houlsby, Pfeiffer)
IA3 (Infused Adapter by Inhibiting and Amplifying)
Multi-adapter serving: LoRAX, S-LoRA
Choosing the right PEFT method for your use case

14.3 Training Platforms & Tools 🟡⚙️🔧

Unsloth: 2x faster fine-tuning with memory optimization
Axolotl: configuration-driven fine-tuning
LLaMA-Factory: web UI for fine-tuning
torchtune: PyTorch-native fine-tuning library with memory-efficient recipes for LoRA/QLoRA on consumer GPUs (24GB VRAM)
TRL (Transformer Reinforcement Learning) library
Cloud training: Google Colab, Lambda Labs, RunPod, Modal
Lab: Use Unsloth to fine-tune Mistral 7B with QLoRA in under 30 minutes on a free Colab GPU

Module 15

Knowledge Distillation & Model Merging

▼

Create smaller, faster models that retain the capabilities of larger ones. Learn distillation techniques and model merging strategies that are widely used in the open-source LLM community. Identified as a gap: core technique behind Phi, Orca, distilled DeepSeek-R1.

15.1 Knowledge Distillation for LLMs 🔴⚙️🔧

Classical distillation: teacher-student framework, soft targets, temperature
Black-box distillation: distilling from API-only models via synthetic data
White-box distillation: logit matching, intermediate layer matching
Case studies: Orca (progressive learning from GPT-4), Phi (textbook-quality data), distilled DeepSeek-R1
Speculative knowledge distillation: training draft models for speculative decoding
Legal and licensing considerations of distillation
⚙️ Small-but-capable model research: the Phi series (Microsoft) demonstrating data quality over quantity; key innovations: synthetic data curriculum, targeted capability training, careful data mixing; Gemma 3 (Google), SmolLM (Hugging Face), Qwen2.5-Coder: similar principles at different scales; implications: 4B models rivaling 70B on specific tasks when trained with the right data; practical relevance: deployment on edge devices, mobile, and cost-constrained environments
Lab: Distill a 70B model's reasoning capabilities into a 7B model via synthetic data generation and SFT

15.2 Model Merging & Composition 🔴⚙️🔧

Model merging intuition: combining strengths of multiple fine-tunes
Merging methods: Linear, SLERP, TIES, DARE, Model Stock
Task arithmetic: adding and subtracting task vectors
Model soups: averaging multiple checkpoints
MergeKit: practical model merging toolkit
Evolutionary model merging: Sakana AI's approach
Lab: Merge two LoRA fine-tunes (one for code, one for chat) using SLERP and TIES; evaluate the combined model

15.3 Continual Learning & Domain Adaptation 🔴🔬

Continual pre-training on domain-specific corpora
Vocabulary extension for new domains/languages
Replay-based methods to prevent catastrophic forgetting
Elastic Weight Consolidation (EWC) and related techniques
Progressive training: curriculum and staged approaches

Project Milestone: Fine-tune a model on the synthetic dataset using QLoRA. Optionally distill or merge with a reasoning adapter. Upload the adapter to Hugging Face Hub.

Module 16

Alignment: RLHF, DPO & Preference Tuning

▼

Align LLMs with human preferences using reinforcement learning and direct optimization methods.

16.1 RLHF: Reinforcement Learning from Human Feedback 🔴🔬

📐 The three-stage alignment pipeline: SFT → Reward Model → PPO
Reward model architecture: same transformer backbone with scalar head; trained on preference pairs (chosen, rejected)
📐 Bradley-Terry model: P(y₁ ≻ y₂) = σ(r(y₁) - r(y₂)): converting preferences to reward signal
PPO for LLMs: policy = the LM, action = next token, reward = RM score; clipped objective to prevent large policy updates
KL divergence penalty: D_KL(π_θ || π_ref): preventing reward hacking and maintaining base model capabilities
Process Reward Models (PRMs): reward per reasoning step vs. Outcome Reward Models (ORMs): reward on final answer only
GRPO (Group Relative Policy Optimization): DeepSeek's approach: no separate reward model, group-relative advantages
RLHF infrastructure: separate processes for generation, reward scoring, and training: distributed architecture

16.2 DPO & Modern Preference Optimization 🔴⚙️🔧

DPO derivation: reparametrizing the RLHF objective to eliminate the reward model: loss = -log σ(β(log π(y_w)/π_ref(y_w) - log π(y_l)/π_ref(y_l)))
DPO internals: implicit reward model, reference model frozen, β controls deviation from reference policy
KTO (Kahneman-Tversky Optimization): works with binary feedback (good/bad) instead of preference pairs: loss-averse weighting
ORPO (monolithic preference tuning without reference model), SimPO (length-normalized rewards), IPO (identity preference optimization)
Creating preference datasets: chosen vs. rejected pairs
Using synthetic preferences from stronger models
Lab: Train a DPO adapter on synthetic preference data using TRL's DPOTrainer

16.3 Constitutional AI & Self-Alignment 🔴🔬

Anthropic's Constitutional AI (CAI) approach
RLAIF: AI feedback instead of human feedback
Self-play and iterative self-improvement
Alignment tax and capability-alignment tradeoffs
🔬 Shallow safety alignment (ICLR 2025 Outstanding Paper): safety training adapts only the first few output tokens of LLM responses; this explains why fine-tuning attacks, prefilling attacks, and adversarial suffix attacks succeed at bypassing safety; implications: need for deepened alignment across all generation steps, regularized fine-tuning objectives

16.4 RLVR: Reinforcement Learning with Verifiable Rewards 🔴⚙️🔧

The RLVR paradigm: training reasoning models using automatically verifiable rewards (math correctness, code execution, formal proofs) instead of human feedback
Why RLVR works without human annotators: verifiable reward signals provide exact supervision
GRPO (Group Relative Policy Optimization) as the core algorithm: relative advantage within sampled groups
⚙️ DeepSeek-R1 training pipeline: cold start SFT, then RLVR on math/code, then rejection sampling, then full SFT, then final RLVR
Extension beyond math/code: RLVR for chemistry, biology, structured reasoning tasks
RLVR extensions: AlphaProof for mathematical proof verification, DeepSeek-Prover-V2 for formal theorem proving, code execution feedback for programming tasks; emerging: RLVR for chemistry (molecular property verification), biology (protein structure validation), and multi-step tool use (action outcome verification)
⚙️ The open reasoning model ecosystem: QwQ, Sky-T1, open reproductions of R1 distillation
Theoretical analysis: RLVR implicitly incentivizes correct intermediate reasoning steps
Lab: Train a small model with RLVR on math problems using verifiable rewards; compare with DPO on the same task

Module 17

Interpretability & Mechanistic Understanding

▼

Peer inside the black box. Understand how and why LLMs produce their outputs using probing, attention analysis, and mechanistic interpretability techniques. Identified as a gap vs. Berkeley CS294-267 Understanding LLMs.

17.1 Attention Analysis & Probing 🟡⚙️🔧

Attention visualization: what do attention heads look at?
Attention patterns: induction heads, previous-token heads, positional heads
Probing classifiers: what information is encoded in hidden states?
Probing classifiers methodology: linear vs. nonlinear probes (linear probes test what is linearly accessible, nonlinear probes may learn the task themselves); control tasks and selectivity (Hewitt and Liang, 2019), ensuring probes measure representation quality, not probe capacity; the "probing is not understanding" critique; practical applications: probing for syntactic structure, world knowledge, factual associations in transformer layers
Logit lens and tuned lens: reading the residual stream
Lab: Visualize attention patterns in a GPT-2 model; use probing to detect syntactic information in hidden layers

17.2 Mechanistic Interpretability 🔴⚙️🔧

Circuits and features: the mechanistic interpretability framework: features as directions in activation space, circuits as computational subgraphs
Sparse autoencoders (SAEs) architecture: encoder W_enc maps activation → high-dimensional sparse code (e.g., 4096 → 65536); ReLU + L1 sparsity penalty forces monosemantic features; decoder W_dec reconstructs activation; trained on cached activations from a target layer
Superposition: why neurons are polysemantic: more features than dimensions; the toy model of superposition (Elhage et al.); feature splitting at scale
Activation patching and causal tracing
TransformerLens and nnsight tooling
Anthropic's interpretability research: scaling monosemanticity
Lab: Use TransformerLens to find and analyze a simple circuit (e.g., indirect object identification) in a small model

17.3 Practical Interpretability for Applications 🟡🔬

Feature attribution: which input tokens matter most?
Integrated Gradients and SHAP for LLMs
⚙️ Representation engineering: steering model behavior via activation vectors
⚙️ Concept erasure and model editing (ROME, MEMIT)
⚙️ Interpretability for debugging: understanding model failures

17.4 Explaining Transformers 🟡🔬

Explaining transformer predictions: attribution methods tailored for attention-based models
Attention rollout and attention flow: propagating attention through layers
Gradient-weighted attention: combining gradient signals with attention weights
Layer-wise relevance propagation (LRP) for transformers
Perturbation-based explanations: token removal, token substitution, and occlusion
Comparing explanation methods: faithfulness, plausibility, and consistency metrics

Module 18

Embeddings, Vector Databases & Semantic Search

▼

Master the retrieval infrastructure that powers RAG systems.

18.1 Text Embedding Models & Training 🟡⚙️🔧

From word embeddings to sentence embeddings: CLS token, mean pooling, [EOS] pooling
Training sentence embeddings end-to-end: Sentence-BERT (SBERT) architecture with siamese/triplet networks; SimCSE (unsupervised: dropout as augmentation, supervised: NLI pairs); contrastive loss, triplet loss with margin, and multiple negatives ranking loss
Contrastive learning for embeddings: InfoNCE loss, in-batch negatives, temperature parameter
Training pipeline: hard negative mining strategies (BM25 negatives, cross-encoder mined negatives, in-batch hard negatives); positive pair construction (anchor, positive, negative)
Multi-stage training: pre-training on weak pairs → fine-tuning on curated pairs (E5, GTE approach)
Sentence-BERT, E5, GTE, Nomic Embed: architecture comparison
Matryoshka embeddings: training with multiple dimensionality loss terms for flexible truncation
Late interaction models: ColBERT architecture: per-token embeddings with MaxSim scoring
API embeddings: OpenAI, Cohere Embed v3, Voyage AI, Jina: pricing and dimension choices
MTEB benchmark internals: task categories, score aggregation, choosing the right model
🔬 Embedding space geometry: curse of dimensionality (distances concentrate in high-d), anisotropy problem (embeddings cluster in narrow cone), isotropy regularization techniques
🔬 Similarity pitfalls: why cosine similarity can mislead: hubness problem (some vectors are near-neighbors of many others), importance of normalization
Fine-tuning embeddings: Sentence-Transformers library, domain-specific training data strategies
Lab: Fine-tune an embedding model on domain-specific data with contrastive loss; visualize embedding space isotropy before/after; compare recall@k

18.2 Vector Index Data Structures & Algorithms 🔴⚙️🔧

Exact nearest neighbor: brute-force O(nd): why it doesn't scale
Similarity metrics internals: cosine (normalized dot product), dot product (magnitude-sensitive), L2 distance: when to use which
HNSW internals: hierarchical navigable small world graph: multi-layer skip-list structure; greedy search from top layer; construction: insert node, connect to M nearest neighbors per layer; parameters: M (connections), efConstruction (build quality), efSearch (query quality); O(log n) search time
IVF internals: inverted file index: k-means clustering of vectors into nlist partitions; at query time, probe nprobe nearest centroids; tradeoff: more probes = higher recall, slower search
Product Quantization (PQ): split d-dimensional vector into m subvectors; quantize each to 256 centroids (1 byte); compress 768-dim float32 (3KB) to m bytes; asymmetric distance computation for query
Composite indexes: IVF-PQ (cluster then compress), HNSW+PQ (graph with compressed storage), IVF-HNSW (HNSW as coarse quantizer)
ScaNN: anisotropic vector quantization for inner product search
Index build time, memory footprint, and recall-latency tradeoffs
Lab: Build HNSW and IVF-PQ indexes in FAISS; benchmark recall@10 vs. query latency vs. memory; tune M, efSearch, nprobe parameters

18.3 Vector Database Systems 🟡⚙️🔧

Vector DB architecture: write-ahead log, segment-based storage, background index building
Managed: Pinecone (serverless architecture, pod-based scaling), Weaviate (module system, hybrid search built-in)
Self-hosted: Qdrant (Rust, gRPC, segment architecture), Milvus (distributed, segment-sealed architecture), ChromaDB (lightweight, SQLite backend)
Embedded: FAISS (C++ with Python bindings), LanceDB (columnar format, zero-copy)
pgvector: vector search inside PostgreSQL: IVFFlat and HNSW index types, when to use vs. dedicated vector DB
Metadata filtering: pre-filter vs. post-filter strategies, payload indexes in Qdrant
Hybrid search internals: combining BM25 keyword scores with vector similarity via reciprocal rank fusion (RRF) or linear combination
Lab: Index the synthetic dataset in Qdrant and pgvector; implement hybrid search; compare latency, recall, and operational complexity

18.4 Document Processing & Chunking 🟡⚙️🔧

Document loaders: PDF, HTML, Markdown, DOCX, code
Chunking strategies: fixed-size, recursive, semantic, document-structure-aware
Overlap, parent-child chunking, sentence-window approach
Unstructured.io, LlamaParse, Docling for document parsing
RAG data pipeline engineering: scheduled ETL with orchestration tools (Airflow, Prefect, Dagster); document versioning and staleness detection (content hashing, last-modified tracking); incremental indexing via change-data-capture patterns; data lineage tracking (which source documents contributed to each answer)
Embedding model version migration: parallel indexes during transition, lazy re-embedding, gradual cutover when switching embedding models
Lab: Build a document ingestion pipeline: parse → chunk → embed → index

Module 19

Retrieval-Augmented Generation (RAG)

▼

Build production-quality RAG systems: from naive implementations to advanced architectures with re-ranking, query transformation, knowledge graphs, and deep research agents.

19.1 RAG Architecture & Fundamentals 🟡⚙️🔧

RAG system architecture: ingestion pipeline (parse → chunk → embed → index) + query pipeline (embed query → retrieve → rerank → augment prompt → generate)
Data flow: document store ↔ vector index ↔ retriever ↔ prompt builder ↔ LLM ↔ output parser
Naive RAG: single-stage retrieval with top-k context injection
Context window management: token budgeting, context ordering (lost-in-the-middle phenomenon), citation injection
When RAG beats fine-tuning (and vice versa): decision framework based on knowledge type, update frequency, and latency
Indexing strategies: full re-index vs. incremental; versioning documents; handling deletions and updates
Lab: Build a full RAG pipeline from scratch (no framework): chunker → embedder → FAISS index → retriever → prompt template → LLM; then rebuild with LangChain and compare

19.2 Advanced RAG Techniques 🔴⚙️🔧

Query transformation: HyDE, multi-query, step-back prompting
BM25 internals: TF saturation (k1 parameter), IDF smoothing (log((N-df+0.5)/(df+0.5))), document length normalization (b parameter): why it remains a strong baseline
Re-ranking with cross-encoders: architecture (BERT takes [query, SEP, document] as single input → relevance score); why cross-attention between query and doc tokens is more powerful than bi-encoder dot product; tradeoff: O(n×d) cost per query-doc pair vs. O(1) for bi-encoder
Cohere Rerank, ColBERT reranking, BGE-reranker: API and open-source options
🔬 Contextual retrieval: prepending LLM-generated context to chunks before embedding (Anthropic approach)
🔬 Corrective RAG (CRAG): LLM self-evaluates retrieval quality → triggers web search fallback if low confidence
🔬 Self-RAG: model learns special tokens to decide when to retrieve, what to cite, and whether output is supported
Fusion retrieval: BM25 + dense vectors combined via Reciprocal Rank Fusion (RRF: score = Σ 1/(k + rank)) or linear interpolation
Multi-modal RAG: images, tables, and charts
LLM-ready web ingestion: Firecrawl, Crawl4AI for converting web pages to clean markdown for RAG pipelines
Lab: Upgrade the basic RAG with HyDE, re-ranking, and multi-query; measure improvements with RAGAS

19.3 RAG with Knowledge Graphs 🔴⚙️

Knowledge graphs: entities, relations, triples (subject, predicate, object); RDF, OWL, and property graph models; construction from unstructured text using NER and relation extraction
Graph embeddings: TransE, TransR, DistMult, ComplEx; representing entities and relations as vectors for link prediction and knowledge base completion
GraphRAG: combining knowledge graph traversal with LLM generation; structured queries over graph databases (Neo4j, Amazon Neptune) to augment LLM context; Microsoft GraphRAG architecture with community detection and hierarchical summarization
LLM-powered knowledge graph construction: entity extraction, relationship mapping, entity resolution, coreference, and relation canonicalization

19.4 Deep Research & Agentic RAG 🔴⚙️🔧

Deep research pattern: multi-step autonomous web research
Query decomposition → parallel search → synthesis → follow-up
OpenAI Deep Research, Perplexity, Google Deep Research paradigms
Iterative refinement: search → read → evaluate → search again
Source credibility assessment and citation verification
Combining web search, document retrieval, and database queries
Lab: Build a deep research agent that autonomously researches a topic across multiple sources, synthesizes findings, and produces a cited report

19.5 Structured Data & Text-to-SQL 🟡⚙️🔧

LLM on tabular data: serialization strategies (row-by-row, markdown tables, JSON); table understanding and reasoning; LLM-based feature engineering for structured data; comparison with XGBoost/LightGBM on tabular benchmarks
Text-to-SQL in depth: translating natural language to database queries; schema linking and column selection; multi-table joins and complex aggregation; error correction via execution feedback; schema representation and context injection strategies
Benchmarks: Spider, Bird, WikiSQL
Table understanding: reading CSVs, spreadsheets, and structured documents
Combining structured (SQL) and unstructured (vector) retrieval
Lab: Build a natural-language-to-SQL interface over a sample database; chain with RAG for hybrid answers

19.6 RAG Frameworks & Orchestration 🟡⚙️🔧

LangChain: chains, retrievers, memory, LCEL
LlamaIndex: index types, query engines, routers
Haystack by deepset
Lab: Implement the same RAG pipeline in LangChain and LlamaIndex; compare developer experience

Project Milestone: Build a production-grade RAG system with hybrid search, re-ranking, text-to-SQL, and citation tracking. Evaluate using RAGAS metrics.

Module 20

Building Conversational AI Systems

▼

Design and implement robust conversational AI: from simple chatbots to complex multi-turn dialogue systems with state management, memory, personas, and personality.

20.1 Dialogue System Architecture 🟡⚙️

Types: task-oriented, open-domain, hybrid
Dialogue state tracking and slot filling
Turn management and context handling
System prompts as behavioral specification

20.2 Personas, Companionship & Creative Writing 🟡⚙️

Persona design: personality, tone, brand voice, backstory
AI companionship: Character.AI patterns, emotional engagement
AI creative writing assistants: ideation, co-writing, style transfer
Consistency challenges: maintaining persona over long conversations
Ethical considerations of parasocial AI relationships

20.3 Memory & Context Management 🟡⚙️🔧

Short-term memory: conversation buffer, sliding window
Long-term memory: summarization, vector store, entity extraction
🔬 MemGPT / Letta architecture in depth: virtual context management with a hierarchical memory system; main context (working memory) vs. archival storage (long-term) vs. recall storage (conversation search); self-directed memory operations (push/pop/search); OS-inspired paging between memory tiers
Session persistence and user profiles
Lab: Build a chatbot with both short-term and long-term memory using a vector store

20.4 Multi-Turn Dialogue & Conversation Flows 🟡⚙️🔧

Handling clarifications, corrections, and topic switches
Guided conversations: form-filling, onboarding, intake flows
Fallback strategies and graceful degradation
Human handoff: when and how to escalate
Runtime context window overflow: when assembled prompt (system + history + retrieved docs + user query) exceeds context limit; priority-based content eviction (trim oldest conversation turns first, then reduce retrieved chunks, never trim system prompt); dynamic context budgeting: allocate percentages (system 10%, history 30%, retrieval 40%, generation 20%); truncation strategies: sentence-boundary truncation, summarize-then-truncate for conversation history
Lab: Build a customer support bot with guided flows, RAG-backed knowledge, and human handoff triggers

20.5 Voice & Multimodal Interfaces 🔴⚙️

Speech-to-text: Whisper, Deepgram, AssemblyAI
Text-to-speech: ElevenLabs, PlayHT, Cartesia
Real-time voice AI: LiveKit, Vapi, Pipecat
Vision in conversations: processing images and screenshots

Module 21

AI Agents: Tool Use, Planning & Reasoning

▼

Build autonomous AI agents that reason, plan, use tools, and take actions. Covers the four core agentic patterns: reflection, tool use, planning, and multi-agent collaboration.

21.1 Foundations of AI Agents 🟡⚙️

What is an agent: perception → reasoning → action loop: formal definition vs. practical usage
Agent vs. chain vs. workflow: definitions, tradeoffs, and decision criteria
The four agentic design patterns (Ng framework): Reflection, Tool Use, Planning, Multi-Agent
ReAct pattern internals: interleaved Thought/Action/Observation traces in the prompt: how reasoning tokens guide tool selection
Agent state machine: states (thinking, tool_calling, waiting_for_result, responding), transitions, termination conditions
🔬 Cognitive architectures: System 1 (fast, single-pass) vs. System 2 (deliberate, multi-step) agent designs
Agent memory data structures: conversation buffer (deque), episodic memory (vector store + metadata), working memory (structured state dict), semantic memory (knowledge graph)
Token budget management: context window allocation across system prompt, memory, retrieved docs, conversation history, and generation

21.2 Tool Use & Function Calling 🟡⚙️🔧

OpenAI function calling / tool use
Anthropic tool use (Claude)
Designing effective tool schemas
Tool result handling and multi-step tool use
MCP (Model Context Protocol): Anthropic's open standard for tool integration: servers, resources, prompts
A2A (Agent-to-Agent Protocol): Google's protocol for inter-agent communication
Building custom tools: APIs, databases, file systems, code execution
Browser automation agents: Browser Use (Python, 50K+ stars, turns any LLM into a browser agent), Stagehand (TypeScript SDK with act/extract/observe primitives)
LLM-ready web scraping: Firecrawl (API converting websites to clean markdown for LLM consumption), Crawl4AI (open-source alternative, 58K+ stars)
🔬 Native tool use training: how frontier models (GPT-4, Claude, Gemini) are trained with tool-calling in the training data, not just prompted; the Toolformer approach (self-supervised tool-use annotation); training data format: interleaved text and tool calls with execution results; reward shaping for tool selection accuracy and efficiency; fine-tuning for domain-specific tools; the gap between prompted tool use and natively trained tool use (reliability, latency, hallucinated calls)
🔬 Agentic training at scale: DeepSeek V3.2 trained on 85,000+ agentic tasks spanning web search, coding, file operations, and multi-step tool use; represents a shift from "tool-use as prompting" to "tool-use as a core training objective"; enables reliable multi-step tool chains without explicit orchestration
Lab: Build an agent with 5+ tools (web search, calculator, database, file I/O, API calls)

21.3 Planning & Agentic Reasoning 🔴⚙️🔧

Plan-and-execute: upfront planning with iterative execution
Agentic reflection loops: detect failure → diagnose → retry with different strategy
LATS (Language Agent Tree Search): Monte Carlo tree search for agents
LLM Compiler: parallel function calling
Human-in-the-loop: when to ask for help
Lab: Implement a plan-and-execute agent that breaks down complex tasks, executes steps, and self-corrects

21.4 Code Generation & Execution Agents 🔴⚙️🔧

Code interpreters: sandboxed execution (E2B, Modal)
Data analysis agents: natural language to pandas/SQL
Code generation and self-debugging patterns
Software engineering agents: Devin-style coding assistants
Security: sandboxing, permission models, resource limits
Lab: Build a data analysis agent that writes and executes Python code in a sandbox

Module 22

Multi-Agent Systems & Orchestration

▼

Scale from single agents to multi-agent architectures. Learn modern agent frameworks, orchestration patterns, and how to build complex systems where multiple agents collaborate.

22.1 Agent Frameworks 🟡⚙️🔧

LangGraph internals: directed graph of nodes (functions) and edges (routing); TypedDict state channels passed between nodes; conditional edges for branching; checkpoint serialization (SQLite/Postgres) for pause/resume; built-in persistence for conversation threads
CrewAI in depth: role-based multi-agent collaboration; agent definition (role, goal, backstory, tools); task objects with expected outputs; sequential and hierarchical process types; delegation and inter-agent communication patterns
AutoGen / AG2 in depth: conversational multi-agent patterns; AssistantAgent, UserProxyAgent, GroupChat, and GroupChatManager; code execution in Docker sandboxes; human-in-the-loop integration; conversation termination strategies; multi-agent debate and reflection patterns
OpenAI Agents SDK
Anthropic Claude Agent SDK
Smolagents (Hugging Face): lightweight agent framework
PydanticAI: type-safe agent development
Google ADK (Agent Development Kit): multi-agent orchestration
Lab: Build the same agent in LangGraph, CrewAI, and native SDK; compare patterns

22.2 Multi-Agent Architecture Patterns 🔴⚙️

Supervisor pattern: orchestrator delegates to specialists
Debate pattern: agents argue for better answers
Pipeline pattern: sequential processing stages
Hierarchical agents: manager → workers
Shared memory and message passing between agents
🔬 Conformity effects in multi-agent LLM systems (ICLR 2025): agents tend to converge on similar outputs (groupthink); factors: model homogeneity, communication structure, majority influence; mitigation: diverse model mixtures, structured debate protocols, devil's advocate roles, independent reasoning before consensus

22.3 Agentic Workflows & Pipelines 🔴⚙️🔧

Workflow engines: LangGraph state machines, Temporal for durable execution
Conditional branching, loops, and parallel execution
Error handling, retries, and compensation logic
Checkpointing and resumability
Streaming intermediate results to users
Lab: Build a multi-agent research system: Planner → Researcher → Writer → Reviewer with human-in-the-loop approval

Project Milestone: Build the full conversational AI agent combining fine-tuned model, RAG, tools, deep research, multi-step planning, reflection, and memory.

Module 23

Multimodal Generation

▼

Extend LLMs beyond text into image, audio, video, and 3D generation. Understand the architectures behind the most impactful generative AI systems.

23.1 Image Generation & Vision-Language Models 🟡⚙️🔧

Diffusion models: DDPM fundamentals, denoising process, latent diffusion (Stable Diffusion architecture)
Flow matching: rectified flows, Flux architecture: the post-diffusion paradigm
Stable Diffusion 3/XL, DALL-E 3, Midjourney, Imagen 3: architecture comparison
Image editing: inpainting, outpainting, ControlNet, IP-Adapter, reference-based generation
Vision Transformer (ViT): patch-based image tokenization, position embeddings for 2D, classification with [CLS] token; comparison with CNNs on data efficiency and scaling
CLIP: contrastive language-image pre-training; dual encoder architecture (image encoder + text encoder); InfoNCE contrastive loss over batch of image-text pairs; zero-shot image classification via text prompts; CLIP as a backbone for downstream vision tasks
BLIP / BLIP-2: bootstrapping language-image pre-training; image captioning, visual QA, and image-text retrieval; Q-Former architecture bridging frozen image encoder and frozen LLM; three-stage pre-training strategy
Vision-language understanding: GPT-4V, LLaVA, Qwen-VL, PaliGemma: visual encoder + LLM fusion
Gemini-style native multimodal: interleaved image-text generation
Lab: Build a product image generation pipeline with Stable Diffusion + ControlNet; integrate GPT-4V for quality assessment

23.2 Audio, Music & Video Generation 🔴🔬

Text-to-speech: VITS, Bark, F5-TTS: modern zero-shot voice synthesis
Voice cloning and voice design: speaker embeddings, voice conversion
Real-time conversational audio: GPT-4o native audio, Moshi (Kyutai)
Music generation: MusicLM, Suno, Udio, Stable Audio: commercial applications
Text-to-video: Sora (DiT architecture), Runway Gen-3, Kling 2, Veo 2: architectures and limitations
3D generation: text-to-3D, image-to-3D: emerging approaches
Multimodal composition pipelines: chaining text → image → video → audio

23.3 Document Understanding & OCR 🟡🔬

TrOCR: transformer-based optical character recognition; encoder-decoder architecture with ViT encoder and text decoder; pre-training on synthetic data, fine-tuning on handwritten and printed text
LayoutLM / LayoutLMv2 / LayoutLMv3: pre-training for document understanding; jointly modeling text, layout (2D position), and image; document classification, key-value extraction, and table detection; applications in invoice processing, form understanding, and receipt parsing
Document AI pipeline: OCR → layout analysis → entity extraction → structured output
Comparison of document understanding approaches: LayoutLM family vs. multimodal LLMs (GPT-4V) vs. specialized OCR pipelines

Module 24

LLM Applications: Vibe-Coding, Finance, Healthcare & Beyond

▼

Survey the most impactful real-world applications of LLMs across industries. For each domain, understand the architecture patterns, unique challenges, risks, and the current state of the art.

24.1 Vibe-Coding & AI-Assisted Software Engineering 🟡⚙️🔧

The "vibe-coding" paradigm: building software via natural language intent rather than manual code
Code completion engines: Copilot (Codex/GPT-4), Cursor (multi-file context), Cline, Windsurf: how they work under the hood
Fill-in-the-middle (FIM) architecture for inline code completion: prefix/suffix/middle prompting
Agentic coding: Claude Code, Devin, OpenHands, SWE-Agent: autonomous multi-file editing, test-driven development loops
Code generation from specs: natural language → working application (Bolt, v0, Lovable, Replit Agent)
SWE-bench: evaluating coding agents on real GitHub issues
Context engineering for code: repo maps, AST parsing, dependency graphs, file ranking
Risks: hallucinated APIs, security vulnerabilities in generated code, over-reliance, licensing concerns
Impact on software engineering: productivity data, skill shifts, junior vs. senior developer effects
Lab: Build a mini "vibe-coding" agent that takes a feature description, generates code, writes tests, runs them, and iterates until passing: using tool use and reflection

24.2 LLMs in Finance & Trading 🟡⚙️🔧

Financial NLP: sentiment analysis on earnings calls, SEC filings, news, social media
FinGPT, BloombergGPT: domain-specific financial models: training data and architecture
Automated report generation: earnings summaries, market research, risk assessments
Trading signal extraction: event detection, entity recognition in financial text
LLM-powered financial advisors: robo-advisory with conversational interface
Regulatory compliance: automated KYC/AML text analysis, regulatory change monitoring
Fraud detection: anomaly detection in transaction narratives
Risks: hallucinated financial data, market manipulation potential, regulatory concerns (SEC, FINRA)
Lab: Build a financial news sentiment analyzer with RAG over SEC filings; generate an automated earnings summary

24.3 Healthcare & Biomedical AI 🟡⚙️

Medical LLMs: Med-PaLM 2, BioMistral, Meditron: training on clinical corpora
Clinical NLP: ICD coding, clinical note summarization, patient intake automation
Medical Q&A and differential diagnosis assistance
Drug discovery: molecular generation, property prediction, literature mining
Protein and genomics: AlphaFold, ESM, DNA language models
Radiology and pathology: multimodal models for medical imaging
Mental health applications: therapy chatbots, crisis detection, ethical boundaries
Regulatory: HIPAA compliance, FDA software-as-medical-device (SaMD), CE marking
Safety-critical considerations: hallucination risk in medical advice, liability, clinician-in-the-loop requirements

24.4 LLM-Powered Recommendation & Search 🟡⚙️🔧

LLMs as recommendation engines: replacing/augmenting collaborative filtering with semantic understanding
Conversational recommendation: dialogue-driven product/content discovery
LLM-powered search: from keyword matching to semantic understanding (Perplexity, Google AI Overviews, SearchGPT)
User preference modeling: extracting interests from natural language interactions
Cold start solution: LLMs for zero-shot recommendation via item description understanding
E-commerce: product description generation, review summarization, personalized shopping assistants
Content recommendation: news, video, music: LLM-based content understanding and matching
Evaluation: beyond click-through: measuring recommendation quality with LLM judges
Lab: Build a conversational movie recommender using LLM + embedding-based retrieval; compare with traditional collaborative filtering

24.5 Cybersecurity & LLMs 🟡⚙️🔧

Defensive applications: threat intelligence summarization, log analysis, anomaly explanation
Vulnerability detection: LLM-powered static analysis, code audit automation
Phishing and social engineering: LLM-generated attacks and LLM-based detection
Malware analysis: binary reverse engineering assistance, decompiled code explanation
Security Operations Center (SOC) automation: alert triage, incident summarization, playbook generation
CTF and penetration testing: LLM agents for automated security testing (authorized contexts only)
Adversarial uses: deepfake text, automated disinformation, voice cloning for fraud
Defense: AI-generated content detection, watermarking, provenance tracking
Lab: Build a log analysis agent that ingests security logs, detects anomalies, explains findings, and suggests remediation

24.6 Education, Legal & Creative Industries 🟡⚙️

Education: AI tutoring (Khanmigo, Duolingo Max), personalized learning paths, automated grading, Socratic dialogue
Legal: contract analysis, case law research, legal document drafting, e-discovery automation
Creative writing & content: AI co-writing tools, screenplay generation, marketing copy, localization
Customer support: automated ticket resolution, sentiment-aware routing, knowledge base generation
Enterprise search & knowledge management: Glean, internal chatbots over corporate documents
Gaming: NPC dialogue generation, dynamic storylines, procedural quest design
Real estate, HR, insurance: industry-specific applications and their LLM architectures

24.7 Robotics, Embodied AI & Scientific Discovery 🔴⚙️

LLMs as robot planners: translating natural language goals into action sequences
SayCan, RT-2, PaLM-E: grounding language in physical actions and observations
Web automation agents: browser control, form filling, UI testing (WebArena, Anthropic computer use)
OS-level agents: desktop automation, multi-application workflows
AI for mathematics: formal reasoning, theorem proving (Lean, AlphaProof, DeepSeek-Prover)
Scientific literature: automated meta-analysis, hypothesis generation, experiment design
Materials science, chemistry: molecular property prediction, retrosynthesis planning
Domain adaptation strategies: when to fine-tune vs. RAG vs. prompt for each vertical

Module 25

Evaluation, Experiment Design & Observability

▼

You can't improve what you can't measure. Learn systematic approaches to evaluating LLM outputs, designing rigorous experiments, testing agent behavior, and monitoring production systems.

25.1 LLM Evaluation Fundamentals 🟡⚙️🔧

Information-theoretic foundations: cross-entropy loss = -E[log p(x)]; perplexity = 2^H(p) = exp(loss): why perplexity is the standard LLM metric; bits-per-byte (BPB) for tokenizer-agnostic comparison
Classical NLP metrics: BLEU (n-gram precision + brevity penalty), ROUGE (recall-oriented), METEOR (alignment-based), BERTScore (contextual embedding similarity)
LLM-as-Judge: using models to evaluate models: pairwise comparison, pointwise scoring, reference-free grading; position bias and self-preference bias
Human evaluation: inter-annotator agreement (Cohen's κ, Fleiss' κ), ranking (ELO/Bradley-Terry from pairwise comparisons, as in Chatbot Arena), Likert scales
Task-specific metrics: accuracy, F1, pass@k for code (unbiased estimator from n samples)
Benchmarks: MMLU, HumanEval, MT-Bench, AlpacaEval, Chatbot Arena (crowdsourced ELO), GPQA, MATH, ARC: what each measures and their limitations
Lab: Evaluate the fine-tuned model on a custom benchmark; compare with base model and API models

25.2 Experimental Design & Statistical Rigor 🔴⚙️🔧

Statistical significance testing for LLM comparisons: bootstrap, paired tests
Confidence intervals and effect sizes
Controlling for randomness: seed management, temperature=0 vs. sampling
Ablation study design: isolating the impact of each component
Common pitfalls: data contamination, benchmark gaming, cherry-picking
🔬 Benchmark contamination detection: methods for identifying when test data leaked into training; n-gram overlap analysis between training corpus and benchmark; membership inference (model confidence on seen vs. unseen examples); canary string insertion (embed unique strings in data, check if model memorizes them); perturbation-based detection (rephrase questions, check if accuracy drops); the scale of the problem: many popular benchmarks are partially contaminated in frontier models
Reproducibility: documenting hyperparameters, data versions, compute
Lab: Design and execute a rigorous ablation study comparing RAG strategies with proper statistical analysis

25.3 RAG & Agent Evaluation 🟡⚙️🔧

RAG metrics: RAGAS (faithfulness, answer relevancy, context precision/recall)
Agent evaluation: task completion, tool accuracy, efficiency
Trajectory evaluation: evaluating the path, not just the outcome
Evaluation frameworks: DeepEval, Ragas, Phoenix
Lab: Run RAGAS evaluation on the project RAG system; evaluate agent trajectories

25.4 Testing LLM Applications 🟡⚙️

Unit testing with mocked LLM responses
Integration testing with real models
Regression testing: detecting quality degradation
Red teaming and adversarial testing
Prompt injection testing
CI/CD integration for LLM evaluations
Testing non-deterministic LLM outputs: assertion-based testing (check for required elements, not exact match); embedding-similarity thresholds for output validation; property-based testing (output always valid JSON, always contains required fields); LLM-judge-in-CI (automated quality gates)
Golden-file/snapshot testing with drift alerting; contract testing for tool call schemas; load testing LLM endpoints (Locust, k6, GuideLLM); promptfoo for prompt regression testing in CI/CD
CI/CD pipeline design: run evals on PR, compare to baseline, gate deployment on quality thresholds

25.5 Observability & Tracing 🟡⚙️🔧

LLM tracing: capturing the full chain of calls
LangSmith: tracing, evaluation, and prompt management
Langfuse: open-source LLM observability
Phoenix by Arize: traces, evals, and debugging
LangWatch (unified observability + evaluations + prompt optimization), TruLens (RAG-focused evaluation: faithfulness, relevance, groundedness feedback functions)
Logging: prompt/completion pairs, latency, token usage, costs
Alerting on quality degradation and anomalies
Lab: Instrument the project agent with Langfuse tracing; build a monitoring dashboard

25.6 LLM-Specific Monitoring & Drift Detection 🔴⚙️🔧

Prompt drift: how evolving user behavior degrades prompt effectiveness over time
Provider version drift: detecting quality changes when OpenAI/Anthropic silently update models
Embedding drift in RAG: documents change, embeddings become stale: re-indexing strategies
Output quality monitoring: automated LLM-judge scoring on production traffic samples, statistical process control charts
Data quality monitoring for LLM pipelines: detecting stale/corrupted documents in knowledge bases, schema validation for structured LLM outputs (Great Expectations, Soda)
Retraining and re-tuning triggers: when production data signals fine-tuning refresh
Lab: Build a monitoring pipeline that detects embedding drift and output quality degradation; set up automated alerts and re-indexing triggers

25.7 LLM Experiment Reproducibility 🟡⚙️

The reproducibility challenge: stochastic LLM outputs, provider API changes, non-deterministic retrieval
Versioning the full stack: prompt templates + retrieval config + model version + system prompt: as a single reproducible artifact
Seed management: temperature=0 vs. sampling, provider-specific determinism options
Configuration management: Hydra, OmegaConf, or YAML configs for LLM pipeline parameters
Dataset versioning: DVC for tracking training data, evaluation sets, and RAG corpora
LLMOps within broader MLOps: unified platforms (MLflow, W&B) that track both classical ML experiments and LLM pipeline runs
Environment reproducibility: Docker for LLM serving, pinning library versions, model snapshot management

Module 26

Production Deployment, Safety & Ethics

▼

Take LLM applications from notebook to production. Cover deployment, scaling, security hardening, and the ethical and regulatory frameworks for responsible AI systems.

26.1 Application Architecture & Deployment 🟡⚙️🔧

Backend frameworks: FastAPI, LitServe for LLM APIs
Streaming responses: SSE, WebSockets
Containerization: Docker, Docker Compose
Cloud deployment: AWS (Bedrock, SageMaker), GCP (Vertex AI), Azure
Serverless: Modal, Replicate, Hugging Face Inference Endpoints
Lab: Deploy the project agent as a FastAPI service with Docker; add streaming and health checks

26.2 Frontend & User Interfaces 🟡⚙️🔧

Gradio: rapid prototyping for AI demos
Streamlit: interactive dashboards with LLM integration
Chainlit: production chat interfaces
Open WebUI: self-hosted ChatGPT-like interface
Vercel AI SDK for Next.js applications
Lab: Build a polished chat interface for the project agent using Chainlit

26.3 Scaling, Performance & Production Guardrails 🔴⚙️

Production latency optimization: streaming responses, request batching, queue management (for model-level optimization see Module 8; for cost-performance tradeoffs see Module 11.4)
Rate limiting, queuing, and backpressure patterns
Auto-scaling strategies for LLM workloads: GPU provisioning, serverless inference
Guardrails: NeMo Guardrails, Guardrails AI, Lakera: input/output filtering in production
Open-source safety classifiers: Llama Guard 3/4 (Meta's content safety model for input/output moderation), Prompt Guard (dedicated prompt injection detector), ShieldGemma (Google's safety classifier)

26.4 LLMOps & Continuous Improvement 🔴⚙️

Prompt versioning and management
A/B testing LLM configurations
Online evaluation and feedback loops
Data flywheels: production data → fine-tuning → improved model
Model registry and artifact management

26.5 LLM Security Threats 🟡⚙️🔧

OWASP Top 10 for LLM Applications
Prompt injection: direct and indirect attacks
Jailbreaking techniques and defenses
Prompt injection defense implementation: input sanitization (strip special tokens, detect injection patterns); sandwich defense (user input between system instructions); delimiter hardening (XML tags, random delimiters); output scanning (detect leaked system prompts, PII in responses using Presidio/regex); LLM-as-judge for injection detection (separate classifier model); runtime PII redaction before user-facing output; API key management: secrets managers (Vault, AWS Secrets Manager), key rotation, per-user API key proxying
Data leakage: training data extraction, PII exposure
Supply chain risks: model poisoning, backdoors
🔬 Formal verification of LLM behavior: applying formal methods to neural networks; certified robustness for NLP (provable bounds on adversarial perturbation sensitivity); abstract interpretation for transformers; verification of safety properties (can we prove a model never generates certain outputs?); current limitations: scalability to billion-parameter models; connection to constitutional AI and runtime guardrails
Lab: Red-team the project agent; implement prompt injection defenses and input/output guardrails

26.6 Hallucination & Reliability 🟡⚙️

Types of hallucination: factual, faithfulness, instruction
Detection: self-consistency, citation verification, NLI
Mitigation: RAG, constrained generation, confidence calibration
When to say "I don't know": abstention and uncertainty

26.7 Bias, Fairness & Ethics 🟡⚙️

Sources of bias in LLMs: training data, RLHF, prompting
Measuring bias: benchmarks, disparate impact, representation
Responsible AI frameworks and documentation (model cards, datasheets)
Environmental impact of LLM training and inference

26.8 Regulation & Compliance 🟡⚙️

EU AI Act: risk classification and requirements
GDPR implications for LLM systems
US executive orders and state-level AI legislation
Industry-specific: healthcare (HIPAA), finance, education
AI governance: policies, auditing, and transparency

26.9 LLM Risk Governance & Audit 🔴⚙️

Enterprise LLM model inventory: cataloging all deployed LLMs, their use cases, risk levels, and owners
Model risk classification: materiality assessment: which LLM decisions require human oversight?
Regulatory model validation frameworks: SR 11-7 (banking), NIST AI RMF, ISO 42001: applied to LLM systems
Audit trails for LLM decisions: logging inputs, outputs, retrieval context, and model versions for compliance
Model lifecycle management: approval gates for deployment, periodic review cadence, decommissioning
Third-party model risk: governance for API-based models where you don't control the weights

26.10 LLM Licensing, IP & Privacy 🟡⚙️

Model license taxonomy: truly open (Apache 2.0, MIT), restricted open (Llama Community License, Gemma Terms), proprietary API
Commercial use restrictions: which models can you deploy in production? Fine-tuning and distillation clauses
IP ownership: who owns fine-tuned weights? Outputs generated by the model? Synthetic training data?
Training data copyright: NYT v. OpenAI implications, opt-out mechanisms (robots.txt, do-not-train headers)
LLM-powered anonymization: using LLMs for PII detection and masking in data pipelines
Differential privacy for synthetic data: formal privacy guarantees when generating training datasets
Privacy-preserving fine-tuning: federated learning approaches, on-device adaptation

26.11 Machine Unlearning 🔴⚙️

🔬 Machine unlearning: methods to remove specific knowledge from trained models
Motivations: GDPR right-to-be-forgotten, copyright removal, safety (removing hazardous knowledge)
Gradient ascent unlearning (maximize loss on target data)
LOKA for continual unlearning without full retraining
Evaluation: how to verify knowledge was truly removed vs. merely suppressed
The fundamental tension: unlearning specific facts while preserving general capabilities
Current limitations: existing benchmarks may be inadequate (CMU 2025)

Module 27

LLM Strategy, Product Management & ROI

▼

The business and organizational layer that turns LLM technology into business value. Covers strategy, product thinking, ROI measurement, vendor evaluation, and compute planning. Addresses critical gaps from the Head of AI and Head of Data Science perspectives.

27.1 LLM Strategy & Use Case Prioritization 🟡⚙️

Assessing organizational AI readiness: data maturity, engineering capability, cultural factors
Use case identification: mapping business processes to LLM capabilities (generation, extraction, classification, reasoning, conversation)
Prioritization frameworks: impact × feasibility matrix, time-to-value, risk-adjusted scoring
Building the business case: from proof-of-concept → pilot → production: stage gates and success criteria
Common failure modes: "solution looking for a problem," over-scoping, underestimating data needs
AI roadmap construction: sequencing LLM initiatives over 6-18 months

27.2 LLM Product Management 🟡⚙️

Translating business problems into LLM requirements: what does "make our customer support better" actually mean?
Defining success metrics beyond model accuracy: CSAT, resolution rate, deflection rate, time-to-resolution, user adoption
Managing the hallucination risk in product context: what is the acceptable error rate? What are the consequences of wrong answers?
User experience design for LLM products: setting expectations, showing confidence, graceful failure
Iterative delivery: ship a prompt-based MVP → add RAG → fine-tune → add agents: incremental value at each step
Stakeholder communication: explaining LLM capabilities and limitations to non-technical executives
Managing user trust: transparency about AI-generated content, appropriate disclaimers

27.3 ROI Measurement & Value Attribution 🔴⚙️🔧

LLM ROI framework: cost savings (automation) + revenue impact (new capabilities) + productivity gains (augmentation)
Measuring coding assistant ROI: developer velocity metrics, PR throughput, time-to-merge, code quality indicators
Measuring customer support automation ROI: ticket deflection rate, average handle time reduction, CSAT impact
Measuring knowledge worker productivity: time studies, task completion rates, output quality
Attribution challenges: isolating LLM impact from other factors, A/B testing LLM features
Common pitfalls: vanity metrics (tokens generated), counting cost savings without quality checks, ignoring maintenance costs
Lab: Build an ROI model for the project's conversational AI agent: estimate cost savings, productivity gains, and payback period

27.4 LLM Vendor Evaluation & Build vs. Buy 🟡⚙️

LLM provider evaluation: model quality, pricing (per-token, per-seat, committed), SLAs, data privacy guarantees, fine-tuning support
Vector database vendor evaluation: managed vs. self-hosted, scaling characteristics, hybrid search support, pricing models
Agent framework evaluation: maturity, community, production readiness, lock-in risk
Vendor platform solutions (Glean, Moveworks, Cohere Enterprise, AWS Bedrock Agents) vs. building in-house
Build vs. buy decision tree: control needs, customization depth, team capability, time-to-market, total cost
Procurement considerations: enterprise agreements, data processing agreements, exit clauses

27.5 LLM Compute Planning & Infrastructure 🔴⚙️

Compute budgeting: modeling costs for training runs (GPU-hours × price) and inference (tokens/day × cost/token)
Cloud strategy: on-demand vs. reserved instances vs. spot for training; GPU selection (A100, H100, L40S) for different workloads
Self-hosted vs. API: breakeven analysis: at what volume does self-hosting become cheaper?
Inference infrastructure planning: estimating peak QPS, provisioning GPUs, auto-scaling strategies
Multi-cloud and hybrid: running training on one cloud, inference on another, RAG on-premises
Capacity planning: forecasting compute needs as usage grows: tokens/day projections, seasonal patterns

Capstone

Final Project: End-to-End Conversational AI Agent

▼

Integrate everything from the course into a complete, deployable conversational AI system built on synthetic data: demonstrating mastery of the full LLM application stack.

C.1 Project Requirements 🎯

Synthetically generated training dataset (10K+ examples) with quality metrics
Fine-tuned model (QLoRA) with evaluation against baseline
Optional: knowledge distillation or model merging for optimized variant
RAG system with hybrid search, re-ranking, and text-to-SQL over a domain knowledge base
Agent with tool use (3+ tools), planning, reflection, memory, and self-correction
Deep research capability for multi-step information gathering
Production deployment with API, chat UI, and observability
Security hardening: prompt injection defenses, input/output guardrails
Evaluation suite with statistical rigor: automated tests, human evaluation, ablation study
Hybrid architecture: classical ML triage + LLM for complex cases, with cost-performance analysis
ROI analysis: business case with TCO, productivity gains, and payback period
LLM risk governance documentation: model card, audit trail, licensing compliance

C.2 Deliverables 🎯

GitHub repository with clean code, documentation, and CI/CD
Hugging Face Hub: fine-tuned model adapter and synthetic dataset
Technical report: architecture decisions, ablation study, evaluation results with confidence intervals
Interpretability analysis: attention visualization or feature analysis of key behaviors
Live demo: deployed application with monitoring dashboard
Presentation: 15-minute project walkthrough

Appendix A

Mathematical Foundations

▼

Reference appendix covering the essential mathematical background for understanding LLMs: linear algebra, calculus, probability, information theory, and optimization.

A.1 Linear Algebra Review 🟢📐

Vectors, matrices, dot products, matrix multiplication
Eigenvalues and eigenvectors

A.2 Calculus Essentials 🟢📐

Derivatives, chain rule, partial derivatives
Gradients and Jacobian matrices

A.3 Probability & Statistics 🟢📐

Bayes' theorem
Distributions: Gaussian, categorical, Bernoulli
Expectation and variance

A.4 Information Theory 🟡📐

Entropy, cross-entropy, KL divergence
Mutual information and perplexity
Derivations and intuition for each concept

A.5 Optimization Theory 🟡📐

Convexity and gradient descent convergence
Learning rate schedules and saddle points

Appendix B

Machine Learning Essentials

▼

Core machine learning concepts that underpin LLM training and evaluation: learning paradigms, loss functions, training pipelines, evaluation metrics, and classical algorithms.

B.1 Learning Paradigms & Loss Functions 🟢📐

Supervised, unsupervised, and reinforcement learning taxonomy
Loss functions: MSE, cross-entropy, hinge loss

B.2 Training Pipeline & Evaluation 🟢📐

Train/val/test splits, overfitting, underfitting, regularization (L1, L2, dropout)
Evaluation metrics: accuracy, precision, recall, F1, ROC-AUC, confusion matrix

B.3 Reinforcement Learning Foundations 🟡📐

Agent, environment, state, action, reward, policy, value function
Policy gradient theorem and PPO intuition

B.4 Classical Algorithms & Feature Engineering 🟢📐

Classical algorithms overview: logistic regression, decision trees, random forests, XGBoost, k-means, PCA
Feature engineering and selection basics

Appendix C

Python for LLM Development

▼

Python tooling and practices essential for LLM development: environment management, key libraries, async programming, type safety, and debugging.

C.1 Environment & Package Management 🟢⚙️

Virtual environments: venv, conda, uv
Package management: pip, requirements.txt, pyproject.toml
Jupyter notebooks and Google Colab workflow

C.2 Essential Libraries & Async Programming 🟢⚙️

Essential libraries: numpy, pandas, matplotlib, seaborn
Async programming: asyncio, aiohttp for parallel API calls

C.3 Type Safety, Data Handling & Debugging 🟡⚙️

Type hints, dataclasses, and Pydantic models
Working with JSON, YAML, and configuration files
Debugging and profiling tools

Appendix D

Environment Setup & Cloud Provisioning

▼

Step-by-step guides for setting up local and cloud development environments for LLM work: GPU setup, model serving, API keys, cloud instances, and containerization.

D.1 Local Setup & Model Serving 🟢⚙️🔧

Local setup: Python 3.10+, CUDA toolkit, PyTorch with GPU support
Local model serving: Ollama installation and usage, llama.cpp setup
Hugging Face: CLI installation, token setup, model downloads, cache management
API key setup: OpenAI, Anthropic, Google AI Studio

D.2 Cloud GPU & Serverless Options 🟡⚙️

Cloud GPU instances: AWS (p4d/p5), GCP (A100/H100), Azure (ND series); spot vs. reserved pricing
Serverless GPU: Modal, RunPod, Lambda Labs, Google Colab Pro

D.3 Docker & Remote Access 🟡⚙️🔧

Docker basics: containerizing LLM applications, GPU passthrough with NVIDIA Container Toolkit
SSH tunneling to remote GPU machines

Appendix E

Git & Collaboration for ML Projects

▼

Version control and collaboration practices tailored for machine learning projects: experiment tracking, data versioning, and notebook management.

E.1 Git for ML Experiments 🟢⚙️🔧

Git essentials for experiment tracking
Branching strategies for ML experiments
.gitignore patterns for ML projects (checkpoints, datasets, cache)

E.2 Data Version Control & Experiment Tracking 🟡⚙️

DVC (Data Version Control) for large files and datasets
Experiment tracking integration: W&B, MLflow
Notebook version control best practices (nbstripout, Jupytext)

Appendix F

Glossary of Terms

▼

Comprehensive alphabetical glossary of 300+ technical terms used throughout the course. Each entry includes a concise definition and a reference to the module where it is first introduced.

F.1 Full Glossary 🟢📐

Key terms: attention, autoregressive, BERT, BPE, chain-of-thought, contrastive learning, cross-entropy, DPO, embedding, fine-tuning, GQA, hallucination, in-context learning, KV cache, LoRA, MoE, perplexity, PEFT, PPO, prompt engineering, quantization, RAG, RLHF, RLVR, RoPE, softmax, tokenizer, transformer, vector database, and 270+ more

Appendix G

Hardware & Compute Reference

▼

Quick-reference tables for GPU specifications, VRAM requirements, training cost estimates, and guidance on when to use different compute tiers.

G.1 GPU Comparison & VRAM Requirements 🟡⚙️

GPU comparison table: A100 (80GB, 2TB/s, 312 TFLOPS FP16), H100 (80GB, 3.35TB/s, 989 TFLOPS FP16), H200 (141GB, 4.8TB/s), L40S (48GB, consumer-grade), RTX 4090 (24GB, hobbyist)
VRAM requirements by model size: 7B (14GB FP16, 4GB INT4), 13B (26GB FP16, 7GB INT4), 70B (140GB FP16, 35GB INT4)

G.2 Training & Inference Benchmarks 🟡⚙️

Training time estimates and cost benchmarks
Inference throughput benchmarks by model and hardware
When to use CPU, single GPU, multi-GPU, or multi-node

Appendix H

Model Card Quick Reference

▼

One-page summaries for the 20 most-used models, covering architecture type, parameter count, context window, license, key strengths, and API access.

H.1 Proprietary Models 🟢📐

GPT-4o, Claude 3.5/4, Gemini 2.5
For each: architecture type, parameter count, context window, license, key strengths, API access

H.2 Open-Weight LLMs 🟢📐

Llama 3/4, Mistral/Mixtral, DeepSeek V3/R1, Phi-4, Qwen 2.5
For each: architecture type, parameter count, context window, license, key strengths, API access

H.3 Specialized & Encoder Models 🟢📐

BERT, RoBERTa, T5, Whisper, CLIP, Stable Diffusion, Sentence-BERT/E5
For each: architecture type, parameter count, context window, license, key strengths, API access

Appendix I

Prompt Template Library

▼

Ready-to-use prompt templates organized by task type: classification, extraction, summarization, code generation, evaluation, synthetic data, and agent systems.

I.1 Classification & Extraction Prompts 🟢⚙️🔧

Classification prompts (sentiment, intent, topic)
Extraction prompts (NER, relation extraction, structured output)

I.2 Summarization & Code Generation Prompts 🟡⚙️🔧

Summarization prompts (abstractive, extractive, multi-document)
Code generation prompts (function generation, debugging, code review)

I.3 Evaluation, Synthetic Data & Agent Prompts 🟡⚙️🔧

Evaluation prompts (LLM-as-judge templates, pairwise comparison)
Synthetic data generation prompts (persona-driven, domain-specific)
Agent system prompts (ReAct, tool-use, planning)

Appendix J

Dataset & Benchmark Reference

▼

Comprehensive reference for major LLM benchmarks and datasets, organized by category. For each: what it measures, size, known limitations, and contamination status.

J.1 Language Understanding & Reasoning 🟢📐

Language understanding: MMLU, HellaSwag, ARC, WinoGrande, TruthfulQA
Reasoning: GSM8K, MATH, BBH, ARC-Challenge

J.2 Code Benchmarks 🟢📐

Code: HumanEval, MBPP, SWE-bench, LiveCodeBench

J.3 Retrieval, Embeddings & RAG 🟡📐

Retrieval and embeddings: MTEB, BEIR, MS MARCO
RAG evaluation: RAGAS metrics, RGB benchmark

J.4 Chat, Instruction & Safety 🟡📐

Chat and instruction: AlpacaEval, MT-Bench, Arena-Hard, Chatbot Arena
Safety: ToxiGen, RealToxicityPrompts, HarmBench

Reference

Tools & Technologies Used

▼

Key libraries, frameworks, and platforms used throughout the course.

Core ML & LLM

PyTorch Hugging Face Transformers TRL PEFT bitsandbytes Unsloth Axolotl MergeKit TransformerLens

Inference & Serving

vLLM TGI SGLang Ollama llama.cpp Triton

LLM APIs & SDKs

OpenAI SDK Anthropic SDK Google Generative AI LiteLLM AWS Bedrock Azure OpenAI Instructor DSPy

RAG & Vector Search

LangChain LlamaIndex ChromaDB Qdrant Pinecone FAISS pgvector Neo4j

Agents & Orchestration

LangGraph CrewAI AutoGen Smolagents PydanticAI MCP Protocol E2B

Data & Evaluation

Hugging Face Datasets Distilabel Argilla RAGAS DeepEval Weights & Biases Outlines

Observability & Deployment

LangSmith Langfuse Phoenix (Arize) FastAPI Gradio Streamlit Chainlit Docker NeMo Guardrails

NLP & Interpretability

spaCy NLTK Gensim Sentence-Transformers tiktoken TransformerLens nnsight

Prerequisites

Learning Outcomes

Table of Contents