Part VI: Agents & Applications
Building LLM applications is only half the challenge; knowing whether they actually work is the other half. Unlike traditional software where correctness is binary, LLM outputs are probabilistic, subjective, and context-dependent. A model that performs brilliantly on one prompt may fail catastrophically on a slight rephrasing. This fundamental uncertainty makes rigorous evaluation, principled experiment design, and continuous observability essential for every LLM project.
This module covers the complete evaluation and monitoring lifecycle. It begins with core evaluation metrics (perplexity, BLEU, ROUGE, BERTScore, LLM-as-Judge) and standard benchmarks (MMLU, HumanEval, MT-Bench, Chatbot Arena). It then addresses experimental design with statistical rigor, including bootstrap confidence intervals, paired tests, and ablation studies. Specialized evaluation for RAG and agent systems follows, covering RAGAS metrics, trajectory evaluation, and frameworks like DeepEval and Phoenix.
The module also covers testing strategies for LLM applications (unit tests, red teaming, prompt injection testing, CI/CD integration), production observability with tracing tools (LangSmith, Langfuse, Phoenix), monitoring for prompt drift, provider version drift, and embedding drift, and reproducibility practices including prompt versioning, config management with Hydra, experiment tracking with MLflow and Weights & Biases, and containerized reproducibility with Docker.