Module 07 · Section 7.2

Open-Source & Open-Weight Models

Llama, Mistral, DeepSeek V3, Qwen, Phi, and the architectural innovations driving open AI

Open weights are like a recipe without the grocery list: you can see every layer, read every parameter, and still spend six months figuring out how they got it to cook.

A Grateful Tinkerer
★ Big Picture

The open-weight revolution. While closed-source models define the frontier, open-weight models have transformed the LLM ecosystem by making powerful models available for download, inspection, fine-tuning, and local deployment. Meta's Llama family ignited this movement, and organizations like DeepSeek, Mistral, Alibaba (Qwen), and Microsoft (Phi) have pushed it further with architectural innovations that rival or surpass closed-source alternatives. This section surveys the major open model families and takes a deep dive into the architectural innovations of DeepSeek V3, which introduced several techniques that fundamentally improve the efficiency of large-scale language modeling.

⚙ Prerequisites

This section builds on the landmark models from Section 6.1 and the attention mechanism from Module 04. Understanding of MoE, GQA, and MLA is developed here; the memory implications of GQA are explored further in Section 8.2.

1. Open-Source vs. Open-Weight: A Distinction That Matters

Before surveying specific models, we must clarify terminology. Most models commonly called "open-source" are more precisely open-weight: the trained model parameters are publicly released, but the training code, training data, and data processing pipelines may remain proprietary. True open-source would include all of these components.

2. Meta Llama: The Catalyst

Llama 3 and 3.1

Meta's Llama 3 family, released in 2024, established the gold standard for open-weight models. The release included 8B, 70B, and 405B parameter variants, all trained on over 15 trillion tokens of multilingual data. Llama 3's architecture builds on the standard dense Transformer with several refinements:

Llama 3.1 extended the context window to 128K tokens using a progressive training strategy: the model was initially trained at 8K context, then extended to 128K through continued pre-training with gradually increasing sequence lengths.

Llama 4: The MoE Leap

Llama 4 marked Meta's transition to Mixture of Experts architectures. The Llama 4 Scout variant uses 16 experts with 17B active parameters out of 109B total, while Llama 4 Maverick scales to 128 experts with 17B active out of 400B total. Both models are natively multimodal, processing text and images within a unified architecture. This shift to MoE allows Meta to scale total model capacity while keeping inference costs proportional to the active parameter count.

3. Mistral and Mixtral

Mistral AI has pursued a strategy of releasing both small, highly efficient models and larger MoE models:

⚡ Key Insight

Key Insight: MoE decouples knowledge from compute. DeepSeek V3 stores 671B parameters of knowledge but activates only 37B parameters per token. This means it has the capacity of a 671B-parameter dense model while running at roughly the speed and cost of a 37B-parameter model. MoE lets you scale what the model knows without proportionally scaling what it costs to run.

4. DeepSeek V3: Architecture Deep Dive

DeepSeek V3 is arguably the most architecturally innovative open-weight model released to date. At 671B total parameters with 37B active per token, it matches or exceeds many closed-source frontier models while introducing four key innovations: Multi-head Latent Attention (MLA), FP8 mixed-precision training, auxiliary-loss-free MoE load balancing, and multi-token prediction.

DeepSeek V3 Architecture Innovations Multi-head Latent Attention (MLA) Standard MHA: cache K,V per head per layer KV cache: 2 x L x H x S x D bytes MLA: compress K,V into low-rank latent c_kv = W_dkv @ x (d_c << n_h * d_h) K = W_uk @ c_kv (upsample at decode) V = W_uv @ c_kv (upsample at decode) Result: ~93% KV cache reduction Cache only c_kv (512-dim) instead of K,V (16K-dim) FP8 Mixed-Precision Training First successful large-scale FP8 training (671B parameters) E4M3 format: 4 exp bits, 3 mantissa bits E5M2 format: 5 exp bits, 2 mantissa bits Strategy: fine-grained quantization with per-block scaling (1x128 tile granularity) Result: ~40% memory reduction in training No quality degradation vs BF16 baseline Auxiliary-Loss-Free MoE Problem: standard MoE uses aux loss to balance expert load, but this conflicts with the language modeling loss Solution: learnable bias terms per expert gate_score = softmax(W_g @ x + bias) Bias terms adjusted dynamically: overloaded expert bias decreases, underloaded expert bias increases Result: better balance without aux loss Multi-Token Prediction (MTP) Standard: predict next 1 token MTP: predict next N tokens simultaneously Additional prediction heads share the main model's representations but predict tokens at positions t+2, t+3, ..., t+N Training signal: richer gradient from multiple future positions Result: improved representations, enables speculative decoding at inference
Figure 7.2: The four key architectural innovations in DeepSeek V3.

4.1 Multi-head Latent Attention (MLA)

The KV cache is the dominant memory bottleneck during autoregressive inference. In standard multi-head attention (MHA), we cache the full key and value tensors for every attention head at every layer. For a model with L layers, H heads, sequence length S, and head dimension D, the cache requires 2 × L × H × S × D elements.

MLA addresses this by introducing a low-rank latent bottleneck. Instead of caching the full key and value tensors, MLA compresses them into a much smaller latent vector:

ckv = Wdkv · x     (compression: dc << nh × dh)

At decode time, the full keys and values are reconstructed from this compressed representation:

K = Wuk · ckv,    V = Wuv · ckv

The key insight is that we only need to cache the compact latent ckv (typically 512 dimensions) rather than the full K and V tensors (which might total 16,384 dimensions across all heads). This achieves roughly a 93% reduction in KV cache size, dramatically increasing the batch sizes and sequence lengths that can fit in GPU memory.

⚡ Key Insight

Making the numbers concrete: Standard multi-head attention with 128 heads and d_head = 128 caches 128 x 128 = 16,384 values per layer per token. MLA compresses all of this into a single 512-dimensional latent vector. Compression ratio: 512 / 16,384 = 3.1%, meaning a 97% reduction in per-token KV cache storage.

# Simplified MLA implementation concept
import torch
import torch.nn as nn

class MultiHeadLatentAttention(nn.Module):
    """Simplified illustration of MLA from DeepSeek V3."""

    def __init__(self, d_model, n_heads, d_head, d_latent):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_head
        self.d_latent = d_latent

        # Query projection (standard)
        self.W_q = nn.Linear(d_model, n_heads * d_head)

        # KV compression: project to low-rank latent
        self.W_dkv = nn.Linear(d_model, d_latent)  # compress

        # KV decompression: reconstruct K, V from latent
        self.W_uk = nn.Linear(d_latent, n_heads * d_head)  # K upsample
        self.W_uv = nn.Linear(d_latent, n_heads * d_head)  # V upsample

        self.W_o = nn.Linear(n_heads * d_head, d_model)

    def forward(self, x, cached_latents=None):
        B, S, D = x.shape

        # Queries: standard projection
        Q = self.W_q(x).view(B, S, self.n_heads, self.d_head)

        # Compress KV to latent space
        c_kv = self.W_dkv(x)  # (B, S, d_latent)

        # Cache only the latent, not full K, V!
        # Memory: S * d_latent  vs  S * 2 * n_heads * d_head
        if cached_latents is not None:
            c_kv = torch.cat([cached_latents, c_kv], dim=1)

        # Decompress to full K, V for attention computation
        K = self.W_uk(c_kv).view(B, -1, self.n_heads, self.d_head)
        V = self.W_uv(c_kv).view(B, -1, self.n_heads, self.d_head)

        # Standard scaled dot-product attention
        # ... (attention computation as usual)

        return output, c_kv  # return latent for caching
⚙ Key Insight

MLA vs. GQA vs. MQA: Grouped Query Attention (GQA, used by Llama 3) reduces cache by sharing KV heads across groups. Multi-Query Attention (MQA) shares a single KV head across all queries. MLA takes a fundamentally different approach: rather than reducing the number of KV heads, it compresses the entire KV representation into a learned low-rank latent space. This achieves greater compression (93% vs. GQA's typical 75%) while preserving more expressive power, since the decompression matrices can reconstruct richer per-head representations than simple head-sharing allows.

ⓘ Note

Looking ahead: We return to the memory optimization implications of GQA and MLA in Section 8.2, where we quantify the KV cache savings these architectures provide during inference.

4.2 FP8 Mixed-Precision Training

DeepSeek V3 was the first model to successfully train at 671B parameters using FP8 (8-bit floating point) precision. Previous large-scale training runs used BF16 or FP16 (16-bit) formats, which double the memory requirements for storing model weights, activations, and gradients.

FP8 comes in two variants:

The challenge with FP8 training is that the reduced precision can cause training instability, especially in operations with large dynamic ranges. DeepSeek solved this through fine-grained quantization: instead of applying a single scaling factor per tensor, they apply per-block scaling factors with a granularity of 1x128 tiles. Each small block of 128 elements gets its own scale factor, allowing different parts of a tensor to use different dynamic ranges.

# Conceptual illustration of fine-grained FP8 quantization
import torch

def quantize_fp8_fine_grained(tensor, block_size=128):
    """
    Fine-grained FP8 quantization as used in DeepSeek V3.
    Each block of 128 elements gets its own scale factor.
    """
    # Reshape into blocks
    original_shape = tensor.shape
    flat = tensor.reshape(-1)
    n_blocks = (flat.numel() + block_size - 1) // block_size

    # Pad if needed
    padded = torch.zeros(n_blocks * block_size, device=tensor.device)
    padded[:flat.numel()] = flat

    blocks = padded.reshape(n_blocks, block_size)

    # Per-block scaling: find max absolute value per block
    max_vals = blocks.abs().max(dim=1, keepdim=True).values
    max_vals = max_vals.clamp(min=1e-12)

    # E4M3 max representable value is 448
    fp8_max = 448.0
    scales = max_vals / fp8_max

    # Quantize each block with its own scale
    quantized = (blocks / scales).clamp(-fp8_max, fp8_max)

    # In practice, this is stored as int8 with scale factors
    return quantized, scales, original_shape

The result: DeepSeek V3 used approximately 40% less GPU memory during training compared to a BF16 baseline, with no measurable degradation in final model quality. This efficiency gain was essential for making the 671B parameter training run feasible on their hardware budget.

4.3 Auxiliary-Loss-Free MoE Load Balancing

In Mixture of Experts models, a gating network routes each token to a subset of experts. A persistent problem is load imbalance: without intervention, the gating network tends to concentrate tokens on a few "popular" experts while leaving others underutilized. This wastes compute capacity and degrades model quality.

The standard solution is an auxiliary loss that penalizes imbalanced routing. This loss term is added to the main language modeling loss and encourages uniform expert utilization. However, the auxiliary loss introduces a tension: optimizing for balanced routing can conflict with optimizing for language modeling quality. The auxiliary loss coefficient must be carefully tuned, and even with tuning, it subtly degrades the primary training objective.

DeepSeek V3 eliminates this conflict with a novel approach: learnable bias terms added to the gating scores:

gate(x) = softmax(Wg · x + b)

The bias terms b are not learned through gradient descent. Instead, they are adjusted dynamically based on observed load statistics: if an expert is overloaded, its bias is decreased; if underloaded, its bias is increased. This adjustment happens outside the gradient computation, meaning the language modeling loss is never contaminated by a balancing objective.

# Auxiliary-loss-free MoE load balancing concept
class AuxLossFreeMoE(nn.Module):
    def __init__(self, d_model, n_experts, top_k=2):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        self.gate = nn.Linear(d_model, n_experts, bias=False)
        self.experts = nn.ModuleList([
            FeedForward(d_model) for _ in range(n_experts)
        ])
        # Dynamic bias terms (NOT trained by gradient descent)
        self.register_buffer(
            'expert_bias',
            torch.zeros(n_experts)
        )
        self.bias_update_rate = 0.001

    def forward(self, x):
        # Compute gating scores with dynamic bias
        logits = self.gate(x) + self.expert_bias  # bias added here
        scores = torch.softmax(logits, dim=-1)

        # Select top-k experts per token
        top_scores, top_indices = scores.topk(self.top_k, dim=-1)
        top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)

        # Route tokens to experts and compute outputs
        output = self._route_and_compute(x, top_scores, top_indices)

        # Update bias based on load (outside gradient computation)
        with torch.no_grad():
            load = self._compute_load(top_indices)
            target_load = x.shape[0] * self.top_k / self.n_experts
            # Decrease bias for overloaded, increase for underloaded
            self.expert_bias -= self.bias_update_rate * (load - target_load)

        return output

4.4 Multi-Token Prediction (MTP)

Standard language model training uses a next-token prediction objective: given the context, predict the immediately next token. DeepSeek V3 augments this with multi-token prediction, where additional lightweight prediction heads simultaneously predict tokens at positions t+2, t+3, and so on.

The benefit is twofold. First, the multi-token objective provides richer training signal, since the hidden representations must encode information about multiple future tokens rather than just one. This produces more informative internal representations. Second, the additional prediction heads can be repurposed at inference time for speculative decoding, where the draft predictions from these heads are verified in parallel, potentially doubling generation speed.

5. Qwen 2.5: Alibaba's Contender

Alibaba's Qwen (Tongyi Qianwen) 2.5 series offers a comprehensive family spanning 0.5B to 72B parameters. The Qwen family is particularly notable for:

6. Microsoft Phi: Small but Capable

Microsoft's Phi series challenges the assumption that bigger is always better. The Phi models use knowledge distillation and curated high-quality training data to achieve performance that punches far above their parameter count:

Model Parameters Key Innovation Performance Note
Phi-3 Mini 3.8B Curated "textbook quality" data Matches Llama 3 8B on some benchmarks
Phi-3 Small 7B Data quality + distillation Competitive with Mixtral 8x7B
Phi-3 Medium 14B Balanced size/quality Approaches GPT-4o mini capability
Phi-4 14B Synthetic data from GPT-4 Strong reasoning, code, math

The Phi approach demonstrates that training data quality can partially compensate for model size. By training on carefully curated, information-dense data (including synthetic data generated by larger models), Phi models achieve a higher "knowledge per parameter" ratio than models trained on raw web crawls.

7. Google Gemma: Open Models from DeepMind

Google's Gemma family brings DeepMind's research into the open-weight ecosystem. Gemma 2 (2024) was released at 2B, 9B, and 27B parameter sizes, trained using techniques from the larger Gemini models. Gemma 3 (2025) expanded to multimodal capabilities, accepting both text and image inputs.

Key characteristics of the Gemma family:

8. Specialized Open Models

Code Models

CodeLlama and StarCoder2 are fine-tuned specifically for code generation. CodeLlama extends Llama with additional training on code-heavy data and supports infilling (generating code to fill a gap between existing code). StarCoder2, from the BigCode project, was trained on The Stack v2 with over 600 programming languages.

Vision-Language Models

LLaVA (Large Language and Vision Assistant) demonstrates the visual instruction tuning approach: connect a pre-trained vision encoder (CLIP) to a language model through a projection layer, then fine-tune on visual question-answering data. This modular approach has spawned many variants and remains a popular architecture for open multimodal models.

Speech Models

Whisper from OpenAI (released with open weights) provides robust speech recognition across 99 languages. Its encoder-decoder architecture processes mel spectrograms and generates text, with optional timestamp prediction for alignment.

8. The Hugging Face Ecosystem

No discussion of open models is complete without the Hugging Face ecosystem, which provides the infrastructure for discovering, downloading, and deploying models:

# Loading and running an open-weight model with Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Download and load Llama 3 8B (requires access approval)
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Half precision for memory
    device_map="auto",            # Automatic GPU placement
)

# Format a chat message
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain MoE in 3 sentences."}
]
input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt"
).to(model.device)

# Generate a response
output = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)
response = tokenizer.decode(
    output[0][input_ids.shape[1]:],
    skip_special_tokens=True
)
print(response)
The transformer architecture was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrent layers with self-attention mechanisms, enabling parallel processing of input sequences and dramatically improving training efficiency for sequence-to-sequence tasks.

The Hugging Face ecosystem includes:

9. Lab: Running Models Locally

For local inference on consumer hardware, llama.cpp and its wrapper Ollama provide optimized C++ inference with quantized models:

# Using Ollama to run models locally
# Install: https://ollama.ai

# Pull and run Llama 3 8B (quantized to 4-bit, ~4.7GB)
# Terminal command:
# ollama pull llama3
# ollama run llama3

# Programmatic access via Python
import requests

def query_ollama(prompt, model="llama3"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
        }
    )
    return response.json()["response"]

# Compare 8B local vs 70B via API
local_response = query_ollama(
    "What are the advantages of Mixture of Experts models?"
)
print("Local 8B response:")
print(local_response)
Local 8B response: Mixture of Experts (MoE) models offer several key advantages: 1. Computational efficiency: Only a subset of parameters (experts) are activated per token, so inference cost scales with active parameters rather than total parameters. 2. Scaling capacity: Total knowledge capacity can grow by adding more experts without proportionally increasing compute. 3. Specialization: Different experts can learn to handle different types of inputs, improving overall model quality.

Section 7.2 Quiz

1. What is the key difference between MLA (Multi-head Latent Attention) and GQA (Grouped Query Attention) in how they reduce KV cache size?
Reveal Answer
GQA reduces cache by sharing key-value heads across multiple query heads, so fewer distinct KV tensors need to be stored. MLA takes a fundamentally different approach: it compresses the entire KV representation into a learned low-rank latent vector, caching only this compact representation and reconstructing the full K and V at decode time. MLA achieves greater compression (approximately 93% vs. GQA's typical 75%) while preserving more expressive power.
2. Why does DeepSeek V3 use per-block scaling with 1x128 tile granularity for FP8 training instead of a single scale per tensor?
Reveal Answer
Different regions of a tensor can have very different value distributions and dynamic ranges. A single scale per tensor forces a compromise that may clip large outliers or lose precision on small values. Per-block scaling with 1x128 granularity allows each small block of 128 elements to have its own scale factor, accommodating local variations in dynamic range. This fine-grained approach enables FP8 precision to match BF16 quality at 671B parameters.
3. How does DeepSeek V3's auxiliary-loss-free MoE solve the tension between load balancing and language modeling quality?
Reveal Answer
Standard MoE models add an auxiliary loss term to the training objective that penalizes unbalanced expert utilization. This creates a tension because optimizing for balance can conflict with optimizing for language quality. DeepSeek V3 replaces the auxiliary loss with learnable bias terms added to gating scores. These biases are adjusted dynamically based on observed load statistics (outside gradient descent), so the language modeling loss is never contaminated by a balancing objective.
4. In Mixtral 8x7B, how many total parameters does the model have, and how many are active per token?
Reveal Answer
Mixtral 8x7B has approximately 46.7B total parameters. Each token is routed to 2 of the 8 experts, so approximately 12.9B parameters are active per token. This decoupling of total capacity from per-token compute is the fundamental economic advantage of MoE architectures.
5. What is the dual benefit of DeepSeek V3's multi-token prediction objective?
Reveal Answer
First, the multi-token objective provides richer training signal because hidden representations must encode information about multiple future tokens, producing more informative internal representations. Second, the additional prediction heads can be repurposed at inference time for speculative decoding, where draft predictions are verified in parallel to potentially double generation speed.

Key Takeaways