Section 7.2: Open-Source & Open-Weight Models

Open weights are like a recipe without the grocery list: you can see every layer, read every parameter, and still spend six months figuring out how they got it to cook.

A Grateful Tinkerer

★ Big Picture

The open-weight revolution. While closed-source models define the frontier, open-weight models have transformed the LLM ecosystem by making powerful models available for download, inspection, fine-tuning, and local deployment. Meta's Llama family ignited this movement, and organizations like DeepSeek, Mistral, Alibaba (Qwen), and Microsoft (Phi) have pushed it further with architectural innovations that rival or surpass closed-source alternatives. This section surveys the major open model families and takes a deep dive into the architectural innovations of DeepSeek V3, which introduced several techniques that fundamentally improve the efficiency of large-scale language modeling.

⚙ Prerequisites

This section builds on the landmark models from Section 6.1 and the attention mechanism from Module 04. Understanding of MoE, GQA, and MLA is developed here; the memory implications of GQA are explored further in Section 8.2.

1. Open-Source vs. Open-Weight: A Distinction That Matters

Before surveying specific models, we must clarify terminology. Most models commonly called "open-source" are more precisely open-weight: the trained model parameters are publicly released, but the training code, training data, and data processing pipelines may remain proprietary. True open-source would include all of these components.

Open-weight: Model weights available for download; may include restrictive licenses (Llama 3's community license, for example, restricts use by companies with over 700M monthly active users)
Open-source: Weights, training code, data, and evaluation code all available under permissive licenses (examples include OLMo from AI2, Pythia from EleutherAI)
Open ecosystem: The broader infrastructure of tools, libraries, and platforms (Hugging Face, vLLM, llama.cpp) that enables practical use of open models

2. Meta Llama: The Catalyst

Llama 3 and 3.1

Meta's Llama 3 family, released in 2024, established the gold standard for open-weight models. The release included 8B, 70B, and 405B parameter variants, all trained on over 15 trillion tokens of multilingual data. Llama 3's architecture builds on the standard dense Transformer with several refinements:

Grouped Query Attention (GQA): Reduces KV cache memory by sharing key-value heads across multiple query heads (8 KV heads for 70B, compared to the full 64 attention heads)
SwiGLU activation: Replaces ReLU in the feed-forward network with Swish-gated linear units, improving training stability and quality
RoPE (Rotary Position Embeddings): Enables efficient extrapolation to longer sequences than seen during training
128K vocabulary: A significantly larger tokenizer than GPT-2's 50K, improving tokenization efficiency especially for non-English languages and code

Llama 3.1 extended the context window to 128K tokens using a progressive training strategy: the model was initially trained at 8K context, then extended to 128K through continued pre-training with gradually increasing sequence lengths.

Llama 4: The MoE Leap

Llama 4 marked Meta's transition to Mixture of Experts architectures. The Llama 4 Scout variant uses 16 experts with 17B active parameters out of 109B total, while Llama 4 Maverick scales to 128 experts with 17B active out of 400B total. Both models are natively multimodal, processing text and images within a unified architecture. This shift to MoE allows Meta to scale total model capacity while keeping inference costs proportional to the active parameter count.

3. Mistral and Mixtral

Mistral AI has pursued a strategy of releasing both small, highly efficient models and larger MoE models:

Mistral 7B: Introduced sliding window attention (4096 token window) for efficient long-context handling, outperforming Llama 2 13B on most benchmarks despite being nearly half the size
Mixtral 8x7B: A sparse MoE model with 8 expert feed-forward networks per layer, routing each token to 2 of the 8 experts. Total parameters: 46.7B; active per token: approximately 12.9B. This architecture achieves Llama 2 70B quality at a fraction of the inference cost.
Mixtral 8x22B: Scales the MoE pattern to 22B-parameter experts, reaching 176B total parameters with approximately 39B active per token

⚡ Key Insight

Key Insight: MoE decouples knowledge from compute. DeepSeek V3 stores 671B parameters of knowledge but activates only 37B parameters per token. This means it has the capacity of a 671B-parameter dense model while running at roughly the speed and cost of a 37B-parameter model. MoE lets you scale what the model knows without proportionally scaling what it costs to run.

4. DeepSeek V3: Architecture Deep Dive

DeepSeek V3 is arguably the most architecturally innovative open-weight model released to date. At 671B total parameters with 37B active per token, it matches or exceeds many closed-source frontier models while introducing four key innovations: Multi-head Latent Attention (MLA), FP8 mixed-precision training, auxiliary-loss-free MoE load balancing, and multi-token prediction.

Figure 7.2: The four key architectural innovations in DeepSeek V3.

4.1 Multi-head Latent Attention (MLA)

The KV cache is the dominant memory bottleneck during autoregressive inference. In standard multi-head attention (MHA), we cache the full key and value tensors for every attention head at every layer. For a model with L layers, H heads, sequence length S, and head dimension D, the cache requires 2 × L × H × S × D elements.

MLA addresses this by introducing a low-rank latent bottleneck. Instead of caching the full key and value tensors, MLA compresses them into a much smaller latent vector:

c kv = W dkv \cdot x (compression: d c << n h \times d h)

At decode time, the full keys and values are reconstructed from this compressed representation:

K = W uk \cdot c kv, V = W uv \cdot c kv

The key insight is that we only need to cache the compact latent c_kv (typically 512 dimensions) rather than the full K and V tensors (which might total 16,384 dimensions across all heads). This achieves roughly a 93% reduction in KV cache size, dramatically increasing the batch sizes and sequence lengths that can fit in GPU memory.

⚡ Key Insight

Making the numbers concrete: Standard multi-head attention with 128 heads and d_head = 128 caches 128 x 128 = 16,384 values per layer per token. MLA compresses all of this into a single 512-dimensional latent vector. Compression ratio: 512 / 16,384 = 3.1%, meaning a 97% reduction in per-token KV cache storage.

# Simplified MLA implementation concept
import torch
import torch.nn as nn

class MultiHeadLatentAttention(nn.Module):
    """Simplified illustration of MLA from DeepSeek V3."""

    def __init__(self, d_model, n_heads, d_head, d_latent):
        super().__init__()
        self.n_heads = n_heads
        self.d_head = d_head
        self.d_latent = d_latent

        # Query projection (standard)
        self.W_q = nn.Linear(d_model, n_heads * d_head)

        # KV compression: project to low-rank latent
        self.W_dkv = nn.Linear(d_model, d_latent)  # compress

        # KV decompression: reconstruct K, V from latent
        self.W_uk = nn.Linear(d_latent, n_heads * d_head)  # K upsample
        self.W_uv = nn.Linear(d_latent, n_heads * d_head)  # V upsample

        self.W_o = nn.Linear(n_heads * d_head, d_model)

    def forward(self, x, cached_latents=None):
        B, S, D = x.shape

        # Queries: standard projection
        Q = self.W_q(x).view(B, S, self.n_heads, self.d_head)

        # Compress KV to latent space
        c_kv = self.W_dkv(x)  # (B, S, d_latent)

        # Cache only the latent, not full K, V!
        # Memory: S * d_latent  vs  S * 2 * n_heads * d_head
        if cached_latents is not None:
            c_kv = torch.cat([cached_latents, c_kv], dim=1)

        # Decompress to full K, V for attention computation
        K = self.W_uk(c_kv).view(B, -1, self.n_heads, self.d_head)
        V = self.W_uv(c_kv).view(B, -1, self.n_heads, self.d_head)

        # Standard scaled dot-product attention
        # ... (attention computation as usual)

        return output, c_kv  # return latent for caching

⚙ Key Insight

MLA vs. GQA vs. MQA: Grouped Query Attention (GQA, used by Llama 3) reduces cache by sharing KV heads across groups. Multi-Query Attention (MQA) shares a single KV head across all queries. MLA takes a fundamentally different approach: rather than reducing the number of KV heads, it compresses the entire KV representation into a learned low-rank latent space. This achieves greater compression (93% vs. GQA's typical 75%) while preserving more expressive power, since the decompression matrices can reconstruct richer per-head representations than simple head-sharing allows.

ⓘ Note

Looking ahead: We return to the memory optimization implications of GQA and MLA in Section 8.2, where we quantify the KV cache savings these architectures provide during inference.

4.2 FP8 Mixed-Precision Training

DeepSeek V3 was the first model to successfully train at 671B parameters using FP8 (8-bit floating point) precision. Previous large-scale training runs used BF16 or FP16 (16-bit) formats, which double the memory requirements for storing model weights, activations, and gradients.

FP8 comes in two variants:

E4M3: 4 exponent bits, 3 mantissa bits. Range: ±448, precision: ~0.1%. Used for forward pass computations.
E5M2: 5 exponent bits, 2 mantissa bits. Wider range (±57344) but lower precision. Used for gradient computation where dynamic range matters more.

The challenge with FP8 training is that the reduced precision can cause training instability, especially in operations with large dynamic ranges. DeepSeek solved this through fine-grained quantization: instead of applying a single scaling factor per tensor, they apply per-block scaling factors with a granularity of 1x128 tiles. Each small block of 128 elements gets its own scale factor, allowing different parts of a tensor to use different dynamic ranges.

# Conceptual illustration of fine-grained FP8 quantization
import torch

def quantize_fp8_fine_grained(tensor, block_size=128):
    """
    Fine-grained FP8 quantization as used in DeepSeek V3.
    Each block of 128 elements gets its own scale factor.
    """
    # Reshape into blocks
    original_shape = tensor.shape
    flat = tensor.reshape(-1)
    n_blocks = (flat.numel() + block_size - 1) // block_size

    # Pad if needed
    padded = torch.zeros(n_blocks * block_size, device=tensor.device)
    padded[:flat.numel()] = flat

    blocks = padded.reshape(n_blocks, block_size)

    # Per-block scaling: find max absolute value per block
    max_vals = blocks.abs().max(dim=1, keepdim=True).values
    max_vals = max_vals.clamp(min=1e-12)

    # E4M3 max representable value is 448
    fp8_max = 448.0
    scales = max_vals / fp8_max

    # Quantize each block with its own scale
    quantized = (blocks / scales).clamp(-fp8_max, fp8_max)

    # In practice, this is stored as int8 with scale factors
    return quantized, scales, original_shape

The result: DeepSeek V3 used approximately 40% less GPU memory during training compared to a BF16 baseline, with no measurable degradation in final model quality. This efficiency gain was essential for making the 671B parameter training run feasible on their hardware budget.

4.3 Auxiliary-Loss-Free MoE Load Balancing

In Mixture of Experts models, a gating network routes each token to a subset of experts. A persistent problem is load imbalance: without intervention, the gating network tends to concentrate tokens on a few "popular" experts while leaving others underutilized. This wastes compute capacity and degrades model quality.

The standard solution is an auxiliary loss that penalizes imbalanced routing. This loss term is added to the main language modeling loss and encourages uniform expert utilization. However, the auxiliary loss introduces a tension: optimizing for balanced routing can conflict with optimizing for language modeling quality. The auxiliary loss coefficient must be carefully tuned, and even with tuning, it subtly degrades the primary training objective.

DeepSeek V3 eliminates this conflict with a novel approach: learnable bias terms added to the gating scores:

gate(x) = softmax(W g \cdot x + b)

The bias terms b are not learned through gradient descent. Instead, they are adjusted dynamically based on observed load statistics: if an expert is overloaded, its bias is decreased; if underloaded, its bias is increased. This adjustment happens outside the gradient computation, meaning the language modeling loss is never contaminated by a balancing objective.

# Auxiliary-loss-free MoE load balancing concept
class AuxLossFreeMoE(nn.Module):
    def __init__(self, d_model, n_experts, top_k=2):
        super().__init__()
        self.n_experts = n_experts
        self.top_k = top_k
        self.gate = nn.Linear(d_model, n_experts, bias=False)
        self.experts = nn.ModuleList([
            FeedForward(d_model) for _ in range(n_experts)
        ])
        # Dynamic bias terms (NOT trained by gradient descent)
        self.register_buffer(
            'expert_bias',
            torch.zeros(n_experts)
        )
        self.bias_update_rate = 0.001

    def forward(self, x):
        # Compute gating scores with dynamic bias
        logits = self.gate(x) + self.expert_bias  # bias added here
        scores = torch.softmax(logits, dim=-1)

        # Select top-k experts per token
        top_scores, top_indices = scores.topk(self.top_k, dim=-1)
        top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)

        # Route tokens to experts and compute outputs
        output = self._route_and_compute(x, top_scores, top_indices)

        # Update bias based on load (outside gradient computation)
        with torch.no_grad():
            load = self._compute_load(top_indices)
            target_load = x.shape[0] * self.top_k / self.n_experts
            # Decrease bias for overloaded, increase for underloaded
            self.expert_bias -= self.bias_update_rate * (load - target_load)

        return output

4.4 Multi-Token Prediction (MTP)

Standard language model training uses a next-token prediction objective: given the context, predict the immediately next token. DeepSeek V3 augments this with multi-token prediction, where additional lightweight prediction heads simultaneously predict tokens at positions t+2, t+3, and so on.

The benefit is twofold. First, the multi-token objective provides richer training signal, since the hidden representations must encode information about multiple future tokens rather than just one. This produces more informative internal representations. Second, the additional prediction heads can be repurposed at inference time for speculative decoding, where the draft predictions from these heads are verified in parallel, potentially doubling generation speed.

5. Qwen 2.5: Alibaba's Contender

Alibaba's Qwen (Tongyi Qianwen) 2.5 series offers a comprehensive family spanning 0.5B to 72B parameters. The Qwen family is particularly notable for:

Strong multilingual performance: Competitive with Llama 3 on English while significantly outperforming it on Chinese, Japanese, Korean, and other Asian languages
Extended context: Qwen 2.5 supports up to 128K tokens with YaRN (Yet another RoPE extensioN) for efficient position interpolation
Specialized variants: Qwen-Coder (code generation), Qwen-Math (mathematical reasoning), and Qwen-VL (vision-language multimodal)
Permissive licensing: Apache 2.0 for most model sizes, enabling unrestricted commercial use

6. Microsoft Phi: Small but Capable

Microsoft's Phi series challenges the assumption that bigger is always better. The Phi models use knowledge distillation and curated high-quality training data to achieve performance that punches far above their parameter count:

Model	Parameters	Key Innovation	Performance Note
Phi-3 Mini	3.8B	Curated "textbook quality" data	Matches Llama 3 8B on some benchmarks
Phi-3 Small	7B	Data quality + distillation	Competitive with Mixtral 8x7B
Phi-3 Medium	14B	Balanced size/quality	Approaches GPT-4o mini capability
Phi-4	14B	Synthetic data from GPT-4	Strong reasoning, code, math

The Phi approach demonstrates that training data quality can partially compensate for model size. By training on carefully curated, information-dense data (including synthetic data generated by larger models), Phi models achieve a higher "knowledge per parameter" ratio than models trained on raw web crawls.

7. Google Gemma: Open Models from DeepMind

Google's Gemma family brings DeepMind's research into the open-weight ecosystem. Gemma 2 (2024) was released at 2B, 9B, and 27B parameter sizes, trained using techniques from the larger Gemini models. Gemma 3 (2025) expanded to multimodal capabilities, accepting both text and image inputs.

Key characteristics of the Gemma family:

Architecture: Decoder-only transformer with GQA, RoPE positional encoding, and GeGLU activation
Knowledge distillation: Smaller Gemma models benefit from distillation from larger Gemini models
Licensing: Gemma uses a permissive license (Gemma Terms of Use) that allows commercial use without usage-based restrictions, distinguishing it from Llama's 700M MAU threshold
Competitive positioning: Gemma 2 27B competes directly with Llama 3 8B and Mistral 7B, consistently ranking well on the Open LLM Leaderboard for its size class

8. Specialized Open Models

Code Models

CodeLlama and StarCoder2 are fine-tuned specifically for code generation. CodeLlama extends Llama with additional training on code-heavy data and supports infilling (generating code to fill a gap between existing code). StarCoder2, from the BigCode project, was trained on The Stack v2 with over 600 programming languages.

Vision-Language Models

LLaVA (Large Language and Vision Assistant) demonstrates the visual instruction tuning approach: connect a pre-trained vision encoder (CLIP) to a language model through a projection layer, then fine-tune on visual question-answering data. This modular approach has spawned many variants and remains a popular architecture for open multimodal models.

Speech Models

Whisper from OpenAI (released with open weights) provides robust speech recognition across 99 languages. Its encoder-decoder architecture processes mel spectrograms and generates text, with optional timestamp prediction for alignment.

8. The Hugging Face Ecosystem

No discussion of open models is complete without the Hugging Face ecosystem, which provides the infrastructure for discovering, downloading, and deploying models:

# Loading and running an open-weight model with Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Download and load Llama 3 8B (requires access approval)
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Half precision for memory
    device_map="auto",            # Automatic GPU placement
)

# Format a chat message
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain MoE in 3 sentences."}
]
input_ids = tokenizer.apply_chat_template(
    messages, return_tensors="pt"
).to(model.device)

# Generate a response
output = model.generate(
    input_ids,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
)
response = tokenizer.decode(
    output[0][input_ids.shape[1]:],
    skip_special_tokens=True
)
print(response)

The transformer architecture was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrent layers with self-attention mechanisms, enabling parallel processing of input sequences and dramatically improving training efficiency for sequence-to-sequence tasks.

The Hugging Face ecosystem includes:

Model Hub: Over 500,000 models with standardized APIs, model cards, and community discussions
Transformers library: Unified Python API for loading and running models from any major architecture
Datasets library: Standardized access to training and evaluation datasets
Spaces: Hosted applications for interactive model demos using Gradio or Streamlit
PEFT: Parameter-Efficient Fine-Tuning methods (LoRA, QLoRA) for adapting large models on consumer hardware

9. Lab: Running Models Locally

For local inference on consumer hardware, llama.cpp and its wrapper Ollama provide optimized C++ inference with quantized models:

# Using Ollama to run models locally
# Install: https://ollama.ai

# Pull and run Llama 3 8B (quantized to 4-bit, ~4.7GB)
# Terminal command:
# ollama pull llama3
# ollama run llama3

# Programmatic access via Python
import requests

def query_ollama(prompt, model="llama3"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
        }
    )
    return response.json()["response"]

# Compare 8B local vs 70B via API
local_response = query_ollama(
    "What are the advantages of Mixture of Experts models?"
)
print("Local 8B response:")
print(local_response)

Local 8B response: Mixture of Experts (MoE) models offer several key advantages: 1. Computational efficiency: Only a subset of parameters (experts) are activated per token, so inference cost scales with active parameters rather than total parameters. 2. Scaling capacity: Total knowledge capacity can grow by adding more experts without proportionally increasing compute. 3. Specialization: Different experts can learn to handle different types of inputs, improving overall model quality.

Section 7.2 Quiz

1. What is the key difference between MLA (Multi-head Latent Attention) and GQA (Grouped Query Attention) in how they reduce KV cache size?

Reveal Answer

GQA reduces cache by sharing key-value heads across multiple query heads, so fewer distinct KV tensors need to be stored. MLA takes a fundamentally different approach: it compresses the entire KV representation into a learned low-rank latent vector, caching only this compact representation and reconstructing the full K and V at decode time. MLA achieves greater compression (approximately 93% vs. GQA's typical 75%) while preserving more expressive power.

2. Why does DeepSeek V3 use per-block scaling with 1x128 tile granularity for FP8 training instead of a single scale per tensor?

Reveal Answer

Different regions of a tensor can have very different value distributions and dynamic ranges. A single scale per tensor forces a compromise that may clip large outliers or lose precision on small values. Per-block scaling with 1x128 granularity allows each small block of 128 elements to have its own scale factor, accommodating local variations in dynamic range. This fine-grained approach enables FP8 precision to match BF16 quality at 671B parameters.

3. How does DeepSeek V3's auxiliary-loss-free MoE solve the tension between load balancing and language modeling quality?

Reveal Answer

Standard MoE models add an auxiliary loss term to the training objective that penalizes unbalanced expert utilization. This creates a tension because optimizing for balance can conflict with optimizing for language quality. DeepSeek V3 replaces the auxiliary loss with learnable bias terms added to gating scores. These biases are adjusted dynamically based on observed load statistics (outside gradient descent), so the language modeling loss is never contaminated by a balancing objective.

4. In Mixtral 8x7B, how many total parameters does the model have, and how many are active per token?

Reveal Answer

Mixtral 8x7B has approximately 46.7B total parameters. Each token is routed to 2 of the 8 experts, so approximately 12.9B parameters are active per token. This decoupling of total capacity from per-token compute is the fundamental economic advantage of MoE architectures.

5. What is the dual benefit of DeepSeek V3's multi-token prediction objective?

Reveal Answer

First, the multi-token objective provides richer training signal because hidden representations must encode information about multiple future tokens, producing more informative internal representations. Second, the additional prediction heads can be repurposed at inference time for speculative decoding, where draft predictions are verified in parallel to potentially double generation speed.

Key Takeaways

Open-weight models now approach frontier closed-source capability on many tasks. The gap has closed dramatically since 2023, driven by innovations in architecture, training data, and efficiency.
Mixture of Experts is the dominant scaling strategy for both open and closed models, decoupling total knowledge capacity from per-token inference cost.
DeepSeek V3's four innovations represent the state of the art in efficient large-scale training: MLA for KV cache compression (93% reduction), FP8 for memory-efficient training, auxiliary-loss-free MoE for clean optimization, and multi-token prediction for richer representations.
Data quality can partially compensate for model size, as demonstrated by the Phi series, which achieves strong performance at 3.8B-14B parameters through curated and synthetic training data.
The Hugging Face ecosystem provides the essential infrastructure (Model Hub, Transformers, Datasets, Spaces) that makes open-weight models practical for production use.
Local inference is now practical through quantization and optimized runtimes (llama.cpp, Ollama), enabling 8B-parameter models to run on consumer laptops.