Module 15: Knowledge Distillation & Model Merging

Chapter Overview

Fine-tuning adapts an existing model to new tasks, but it is not the only way to create specialized models. Knowledge distillation transfers capabilities from a large "teacher" model into a smaller, faster "student" model, enabling deployment at a fraction of the cost. Model merging combines multiple fine-tuned models into a single model that inherits capabilities from all of them, without any additional training.

These techniques have produced some of the most impressive results in the open-source LLM ecosystem. Microsoft's Phi models used distillation from GPT-4 to create small models that punch far above their weight. Community model merges on the Open LLM Leaderboard routinely outperform their constituent models. DeepSeek used distillation to create efficient reasoning models from their larger R1 teacher.

This module also covers continual learning: how to adapt models to new domains over time without catastrophically forgetting their general capabilities. By the end, you will understand the complete toolkit for creating, combining, and evolving specialized LLMs for production deployment.

Learning Objectives

Explain the theory of knowledge distillation, including soft targets, temperature scaling, and the KL divergence loss
Implement both white-box and black-box distillation pipelines for LLMs
Analyze case studies of successful distillation (Orca, Phi, distilled DeepSeek-R1) and extract design principles
Apply model merging methods (Linear, SLERP, TIES, DARE) using MergeKit to combine specialized models
Understand task arithmetic and model soups as approaches to multi-task model composition
Design continual pre-training pipelines for domain adaptation with replay and regularization strategies
Implement vocabulary extension for domain-specific terminology without degrading general performance
Evaluate merged and distilled models against their source models using appropriate benchmarks

Prerequisites

Module 13: Fine-Tuning Fundamentals (training workflow, loss functions, evaluation)
Module 14: Parameter-Efficient Fine-Tuning (LoRA, adapter merging concepts)
Module 06: Inside the Transformer (softmax, attention, weight matrices)
Module 08: Inference Optimization (quantization, model formats, serving)
Familiarity with PyTorch training and the Hugging Face ecosystem

Chapter Overview

Learning Objectives

Prerequisites

Sections