Section 15.2: Model Merging & Composition

★ Big Picture

Model merging creates multi-skilled models by combining weights from specialized fine-tunes, requiring no GPU training at all. If you have a model fine-tuned for code and another fine-tuned for medical text, merging can produce a single model that handles both domains. This sounds like magic, and the theoretical understanding is still developing, but the empirical results are striking. Merged models regularly top the Open LLM Leaderboard, and the technique has become a core tool in the open-source model ecosystem. This section covers the key merging algorithms (Linear, SLERP, TIES, DARE), the theoretical framework of task arithmetic, and practical workflows using MergeKit.

1. Why Model Merging Works

Model merging exploits a remarkable property of neural network loss landscapes: models fine-tuned from the same base occupy a connected region of parameter space where linear interpolations between them tend to perform well. When two models are fine-tuned from the same pretrained checkpoint, their weight differences from the base represent "task vectors" in parameter space. These task vectors can be combined arithmetically because the underlying loss landscape in the neighborhood of the pretrained model is approximately convex.

The key constraint is that all models being merged must share the same architecture and originate from the same base pretrained checkpoint. You cannot merge a Llama model with a Mistral model, or even two Llama models fine-tuned from different pretrained versions. The common ancestry is what ensures the weight spaces are compatible.

Figure 1: Task vectors represent the weight changes from fine-tuning. Adding task vectors combines capabilities.

2. Merging Methods

2.1 Linear (Weighted Average)

The simplest merging method computes a weighted average of model weights. Given models A and B with weights W_A and W_B, the merged model has weights:

W_merged = α · W_A + (1 - α) · W_B

The mixing coefficient α controls the balance between models. Setting α = 0.5 gives equal weight to both. Linear merging is fast and simple, but it can produce mediocre results when the models have very different weight distributions because opposing parameter changes can cancel each other out.

2.2 SLERP (Spherical Linear Interpolation)

SLERP treats weight vectors as points on a high-dimensional sphere and interpolates along the geodesic (shortest path on the sphere surface) rather than in a straight line through the interior. This preserves the magnitude of weight vectors better than linear interpolation, which tends to shrink weights toward zero when the models diverge.

SLERP is applied layer by layer (or parameter by parameter) and produces consistently better results than linear averaging. It is the recommended default for merging two models.

◆ Key Insight

SLERP can only merge exactly two models at a time. For merging three or more models, you must either chain multiple SLERP operations (merge A+B, then merge result+C) or use a method that natively supports multiple inputs like Linear averaging or TIES. The order of chained SLERP merges can affect the result, so experiment with different orderings.

2.3 TIES (TRIM, Elect Sign, merge)

TIES-Merging (2023) addresses the interference problem that plagues simple averaging. When two fine-tuned models modify the same parameter in opposite directions, averaging cancels out both changes. TIES handles this through three steps:

TRIM: Remove small-magnitude changes (below a threshold) that are likely noise rather than meaningful task knowledge.
Elect Sign: For each parameter, take a majority vote across models on the sign of the change. Parameters where models disagree on direction are resolved by the majority.
Merge: Average only the values that agree with the elected sign, zeroing out conflicting contributions.

Figure 2: TIES reduces interference by trimming noise, resolving sign conflicts, and merging only aligned contributions.

2.4 DARE (Drop And REscale)

DARE (2024) takes a different approach to reducing interference: it randomly drops a large fraction (typically 90-99%) of the delta parameters before merging, then rescales the remaining parameters to compensate. The intuition is that fine-tuning changes are highly redundant, and a small random subset of changes captures the essential task knowledge. By keeping only a sparse set of changes from each model, the probability of destructive interference is dramatically reduced.

DARE can be combined with other merging methods (DARE+TIES is a popular combination) and tends to work especially well when merging many models or when the models have been fine-tuned for very different tasks.

2.5 Model Stock

Model Stock (2024) draws on ideas from portfolio theory in finance. Instead of merging all models uniformly, it selects an optimal subset and weighting based on the geometric properties of the models in weight space. It computes the distances between models and the base, then assigns weights that maximize diversity while minimizing deviation from the base. This produces more robust merges, particularly when some of the input models are lower quality or overfitted.

3. Merging Method Comparison

Method	Models	Handles Interference	Complexity	Best For
Linear	2+	No	Very Low	Quick baseline, similar models
SLERP	2 only	Partially	Low	Default for two-model merges
TIES	2+	Yes (sign election)	Medium	Merging 3+ diverse models
DARE	2+	Yes (sparsification)	Medium	Many models, high diversity
DARE+TIES	2+	Yes (both)	Medium	Best overall quality
Model Stock	2+	Yes (selection)	High	Quality-sensitive merges

4. Task Arithmetic

Task arithmetic provides the theoretical framework for understanding model merging. A "task vector" is defined as the element-wise difference between a fine-tuned model and its pretrained base: τ = W_ft - W_base. Task arithmetic shows that these vectors can be manipulated algebraically:

Addition: W_base + τ_A + τ_B creates a model with both capabilities
Negation: W_base - τ_A removes capability A (useful for unlearning)
Scaling: W_base + λτ_A controls the strength of adaptation

import torch
from transformers import AutoModelForCausalLM

def compute_task_vector(base_model_id, finetuned_model_id):
    """Compute task vector: delta = finetuned - base."""
    base = AutoModelForCausalLM.from_pretrained(base_model_id)
    finetuned = AutoModelForCausalLM.from_pretrained(finetuned_model_id)

    task_vector = {}
    for name in base.state_dict():
        task_vector[name] = (
            finetuned.state_dict()[name].float()
            - base.state_dict()[name].float()
        )
    return task_vector

def apply_task_vectors(base_model_id, task_vectors, scaling_factors):
    """Apply scaled task vectors to base model."""
    model = AutoModelForCausalLM.from_pretrained(base_model_id)
    state_dict = model.state_dict()

    for tv, scale in zip(task_vectors, scaling_factors):
        for name in state_dict:
            state_dict[name] = state_dict[name].float() + scale * tv[name]

    model.load_state_dict(state_dict)
    return model

# Example: combine code and medical capabilities
code_tv = compute_task_vector("base-model", "code-finetuned")
medical_tv = compute_task_vector("base-model", "medical-finetuned")

merged = apply_task_vectors(
    "base-model",
    [code_tv, medical_tv],
    [0.7, 0.5],   # Scaling factors per task
)

ⓘ Note

Scaling factors for task vectors typically range from 0.3 to 1.0. Values above 1.0 amplify the fine-tuning effect but risk instability. When merging multiple task vectors, reduce the individual scaling factors to prevent the combined effect from being too strong. A good starting point is 1.0 / (number of task vectors) for each, then tune upward.

5. Model Soups

Model soups (Wortsman et al., 2022) are a specialized form of model merging where you average multiple checkpoints from the same training run or from runs with different hyperparameters. The key insight is that checkpoints along a training trajectory, or from runs that differ only in learning rate or data ordering, lie in a connected low-loss basin. Averaging these checkpoints produces a model that is more robust and generalizes better than any individual checkpoint.

import torch
from pathlib import Path

def create_model_soup(checkpoint_dirs: list[str], model_class):
    """Average weights from multiple training checkpoints."""

    # Load all checkpoint state dicts
    state_dicts = []
    for ckpt_dir in checkpoint_dirs:
        model = model_class.from_pretrained(ckpt_dir)
        state_dicts.append(model.state_dict())
        del model  # Free memory

    n = len(state_dicts)
    print(f"Creating soup from {n} checkpoints")

    # Uniform average
    soup_state = {}
    for key in state_dicts[0]:
        soup_state[key] = sum(sd[key].float() for sd in state_dicts) / n

    # Load into fresh model
    model = model_class.from_pretrained(checkpoint_dirs[0])
    model.load_state_dict(soup_state)
    return model

# Example: soup from checkpoints at different training stages
soup_model = create_model_soup(
    ["./checkpoints/epoch-2", "./checkpoints/epoch-3",
     "./checkpoints/epoch-4", "./checkpoints/epoch-5"],
    AutoModelForCausalLM,
)

6. MergeKit: The Standard Tool

MergeKit is the most widely used tool for model merging, supporting all major algorithms and providing a YAML-based configuration interface. It handles the complex details of loading models, computing task vectors, applying merge operations, and saving results in standard Hugging Face format.

# Install: pip install mergekit

# SLERP merge configuration (merge_slerp.yml)
slices:
  - sources:
      - model: models/code-llama-7b
        layer_range: [0, 32]
      - model: models/medical-llama-7b
        layer_range: [0, 32]

merge_method: slerp
base_model: models/code-llama-7b
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]  # Gradient across layers
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5                     # Default for other params
dtype: bfloat16

# Run the merge: $ mergekit-yaml merge_slerp.yml ./merged-output --cuda # TIES merge with 3 models: $ mergekit-yaml merge_ties.yml ./ties-output --cuda

# TIES merge configuration (merge_ties.yml)
models:
  - model: models/code-llama-7b
    parameters:
      density: 0.5    # Keep top 50% of changes
      weight: 1.0
  - model: models/medical-llama-7b
    parameters:
      density: 0.5
      weight: 0.8
  - model: models/math-llama-7b
    parameters:
      density: 0.5
      weight: 0.6

merge_method: ties
base_model: meta-llama/Llama-3-8B
parameters:
  normalize: true
dtype: bfloat16

⚠ Warning

Model merging requires enough system memory (RAM, not GPU VRAM) to hold all models simultaneously. Merging three 7B models in BF16 requires roughly 42 GB of RAM. Use the --lazy-unpickle flag in MergeKit to reduce memory usage by loading models incrementally. For very large models, consider merging on a cloud instance with sufficient RAM rather than on a local machine.

7. Evolutionary Model Merging

Evolutionary model merging (Sakana AI, 2024) automates the search for optimal merge configurations using evolutionary algorithms. Instead of manually selecting merge methods, weights, and layer-specific parameters, an evolutionary optimizer explores the space of possible merges and evaluates each candidate against a benchmark suite. This approach has produced merged models that significantly outperform manually configured merges.

The search space includes merge method selection, per-layer interpolation weights, density parameters (for TIES/DARE), and even layer permutations. The evolutionary algorithm (typically CMA-ES or NSGA-II) optimizes these parameters against multiple objectives: performance on target benchmarks, retention of base model capabilities, and overall coherence.

Figure 3: Evolutionary optimization searches the space of merge configurations, evaluating each candidate on benchmarks.

Section 15.2 Quiz

1. What is the fundamental requirement for two models to be mergeable, and why?

Show Answer

Both models must share the same architecture and originate from the same pretrained base checkpoint. This requirement exists because the weight spaces must be compatible. Models fine-tuned from the same base occupy a connected region of parameter space where interpolation between them traverses low-loss areas. Models from different bases have incompatible weight spaces where interpolation produces garbage.

2. Why does SLERP generally outperform linear averaging for model merging?

Show Answer

Linear averaging interpolates through the interior of the weight space, which tends to shrink weight magnitudes when the two models diverge. SLERP interpolates along the surface of a hypersphere, preserving the magnitude of weight vectors while smoothly transitioning direction. This magnitude preservation maintains the scaling properties that the model learned during training, resulting in better performance.

3. Explain the three steps of TIES-Merging and why each is necessary.

Show Answer

TRIM removes small-magnitude parameter changes that are likely training noise, reducing the chance of noise accumulation across models. ELECT SIGN resolves direction conflicts by taking a majority vote when models modify the same parameter in opposite directions. MERGE averages only the values that agree with the elected sign, zeroing out conflicting contributions. Together, these steps prevent the destructive interference where opposing changes cancel each other out during averaging.

4. What is a task vector, and how does task arithmetic enable model composition?

Show Answer

A task vector is the element-wise difference between a fine-tuned model's weights and its pretrained base weights: τ = W_ft - W_base. It represents "what the model learned" during fine-tuning. Task arithmetic shows that these vectors can be added (to combine capabilities), subtracted (to remove capabilities), and scaled (to control strength). Adding multiple task vectors to a base model creates a single model that inherits capabilities from all source fine-tunes.

5. When would you use model soups versus multi-model merging (TIES/DARE)?

Show Answer

Model soups average checkpoints from the same training run or from runs that differ only in hyperparameters (learning rate, data ordering). The goal is improved robustness and generalization for a single task. Multi-model merging (TIES/DARE) combines models fine-tuned for different tasks to create a multi-skill model. Use soups when you want a more robust version of your existing model; use TIES/DARE when you want to combine capabilities from models trained on different domains or tasks.

Key Takeaways

Model merging combines specialized models into multi-skilled models with zero additional training, requiring only system RAM (not GPU VRAM).
All models must share the same base architecture and pretrained checkpoint for merging to produce meaningful results.
SLERP is the best default for two-model merges, preserving weight magnitudes through spherical interpolation rather than linear averaging.
TIES and DARE handle interference when merging three or more models, using sign election and sparsification to prevent conflicting changes from canceling out.
Task arithmetic provides the theoretical foundation: task vectors (fine-tuning deltas) can be added, subtracted, and scaled to compose and decompose model capabilities.
Model soups average checkpoints from the same training process to improve robustness and generalization for a single task.
MergeKit is the standard tool for practical model merging, supporting all major algorithms through YAML configuration files.
Evolutionary merging automates the search for optimal merge configurations, often outperforming manually tuned merges.