Open weights are like a recipe without the grocery list: you can see every layer, read every parameter, and still spend six months figuring out how they got it to cook.
A Grateful TinkererThe open-weight revolution. While closed-source models define the frontier, open-weight models have transformed the LLM ecosystem by making powerful models available for download, inspection, fine-tuning, and local deployment. Meta's Llama family ignited this movement, and organizations like DeepSeek, Mistral, Alibaba (Qwen), and Microsoft (Phi) have pushed it further with architectural innovations that rival or surpass closed-source alternatives. This section surveys the major open model families and takes a deep dive into the architectural innovations of DeepSeek V3, which introduced several techniques that fundamentally improve the efficiency of large-scale language modeling.
This section builds on the landmark models from Section 6.1 and the attention mechanism from Module 04. Understanding of MoE, GQA, and MLA is developed here; the memory implications of GQA are explored further in Section 8.2.
1. Open-Source vs. Open-Weight: A Distinction That Matters
Before surveying specific models, we must clarify terminology. Most models commonly called "open-source" are more precisely open-weight: the trained model parameters are publicly released, but the training code, training data, and data processing pipelines may remain proprietary. True open-source would include all of these components.
- Open-weight: Model weights available for download; may include restrictive licenses (Llama 3's community license, for example, restricts use by companies with over 700M monthly active users)
- Open-source: Weights, training code, data, and evaluation code all available under permissive licenses (examples include OLMo from AI2, Pythia from EleutherAI)
- Open ecosystem: The broader infrastructure of tools, libraries, and platforms (Hugging Face, vLLM, llama.cpp) that enables practical use of open models
2. Meta Llama: The Catalyst
Llama 3 and 3.1
Meta's Llama 3 family, released in 2024, established the gold standard for open-weight models. The release included 8B, 70B, and 405B parameter variants, all trained on over 15 trillion tokens of multilingual data. Llama 3's architecture builds on the standard dense Transformer with several refinements:
- Grouped Query Attention (GQA): Reduces KV cache memory by sharing key-value heads across multiple query heads (8 KV heads for 70B, compared to the full 64 attention heads)
- SwiGLU activation: Replaces ReLU in the feed-forward network with Swish-gated linear units, improving training stability and quality
- RoPE (Rotary Position Embeddings): Enables efficient extrapolation to longer sequences than seen during training
- 128K vocabulary: A significantly larger tokenizer than GPT-2's 50K, improving tokenization efficiency especially for non-English languages and code
Llama 3.1 extended the context window to 128K tokens using a progressive training strategy: the model was initially trained at 8K context, then extended to 128K through continued pre-training with gradually increasing sequence lengths.
Llama 4: The MoE Leap
Llama 4 marked Meta's transition to Mixture of Experts architectures. The Llama 4 Scout variant uses 16 experts with 17B active parameters out of 109B total, while Llama 4 Maverick scales to 128 experts with 17B active out of 400B total. Both models are natively multimodal, processing text and images within a unified architecture. This shift to MoE allows Meta to scale total model capacity while keeping inference costs proportional to the active parameter count.
3. Mistral and Mixtral
Mistral AI has pursued a strategy of releasing both small, highly efficient models and larger MoE models:
- Mistral 7B: Introduced sliding window attention (4096 token window) for efficient long-context handling, outperforming Llama 2 13B on most benchmarks despite being nearly half the size
- Mixtral 8x7B: A sparse MoE model with 8 expert feed-forward networks per layer, routing each token to 2 of the 8 experts. Total parameters: 46.7B; active per token: approximately 12.9B. This architecture achieves Llama 2 70B quality at a fraction of the inference cost.
- Mixtral 8x22B: Scales the MoE pattern to 22B-parameter experts, reaching 176B total parameters with approximately 39B active per token
Key Insight: MoE decouples knowledge from compute. DeepSeek V3 stores 671B parameters of knowledge but activates only 37B parameters per token. This means it has the capacity of a 671B-parameter dense model while running at roughly the speed and cost of a 37B-parameter model. MoE lets you scale what the model knows without proportionally scaling what it costs to run.
4. DeepSeek V3: Architecture Deep Dive
DeepSeek V3 is arguably the most architecturally innovative open-weight model released to date. At 671B total parameters with 37B active per token, it matches or exceeds many closed-source frontier models while introducing four key innovations: Multi-head Latent Attention (MLA), FP8 mixed-precision training, auxiliary-loss-free MoE load balancing, and multi-token prediction.
4.1 Multi-head Latent Attention (MLA)
The KV cache is the dominant memory bottleneck during autoregressive inference. In standard multi-head attention (MHA), we cache the full key and value tensors for every attention head at every layer. For a model with L layers, H heads, sequence length S, and head dimension D, the cache requires 2 × L × H × S × D elements.
MLA addresses this by introducing a low-rank latent bottleneck. Instead of caching the full key and value tensors, MLA compresses them into a much smaller latent vector:
At decode time, the full keys and values are reconstructed from this compressed representation:
The key insight is that we only need to cache the compact latent ckv (typically 512 dimensions) rather than the full K and V tensors (which might total 16,384 dimensions across all heads). This achieves roughly a 93% reduction in KV cache size, dramatically increasing the batch sizes and sequence lengths that can fit in GPU memory.
Making the numbers concrete: Standard multi-head attention with 128 heads and d_head = 128 caches 128 x 128 = 16,384 values per layer per token. MLA compresses all of this into a single 512-dimensional latent vector. Compression ratio: 512 / 16,384 = 3.1%, meaning a 97% reduction in per-token KV cache storage.
# Simplified MLA implementation concept
import torch
import torch.nn as nn
class MultiHeadLatentAttention(nn.Module):
"""Simplified illustration of MLA from DeepSeek V3."""
def __init__(self, d_model, n_heads, d_head, d_latent):
super().__init__()
self.n_heads = n_heads
self.d_head = d_head
self.d_latent = d_latent
# Query projection (standard)
self.W_q = nn.Linear(d_model, n_heads * d_head)
# KV compression: project to low-rank latent
self.W_dkv = nn.Linear(d_model, d_latent) # compress
# KV decompression: reconstruct K, V from latent
self.W_uk = nn.Linear(d_latent, n_heads * d_head) # K upsample
self.W_uv = nn.Linear(d_latent, n_heads * d_head) # V upsample
self.W_o = nn.Linear(n_heads * d_head, d_model)
def forward(self, x, cached_latents=None):
B, S, D = x.shape
# Queries: standard projection
Q = self.W_q(x).view(B, S, self.n_heads, self.d_head)
# Compress KV to latent space
c_kv = self.W_dkv(x) # (B, S, d_latent)
# Cache only the latent, not full K, V!
# Memory: S * d_latent vs S * 2 * n_heads * d_head
if cached_latents is not None:
c_kv = torch.cat([cached_latents, c_kv], dim=1)
# Decompress to full K, V for attention computation
K = self.W_uk(c_kv).view(B, -1, self.n_heads, self.d_head)
V = self.W_uv(c_kv).view(B, -1, self.n_heads, self.d_head)
# Standard scaled dot-product attention
# ... (attention computation as usual)
return output, c_kv # return latent for caching
MLA vs. GQA vs. MQA: Grouped Query Attention (GQA, used by Llama 3) reduces cache by sharing KV heads across groups. Multi-Query Attention (MQA) shares a single KV head across all queries. MLA takes a fundamentally different approach: rather than reducing the number of KV heads, it compresses the entire KV representation into a learned low-rank latent space. This achieves greater compression (93% vs. GQA's typical 75%) while preserving more expressive power, since the decompression matrices can reconstruct richer per-head representations than simple head-sharing allows.
Looking ahead: We return to the memory optimization implications of GQA and MLA in Section 8.2, where we quantify the KV cache savings these architectures provide during inference.
4.2 FP8 Mixed-Precision Training
DeepSeek V3 was the first model to successfully train at 671B parameters using FP8 (8-bit floating point) precision. Previous large-scale training runs used BF16 or FP16 (16-bit) formats, which double the memory requirements for storing model weights, activations, and gradients.
FP8 comes in two variants:
- E4M3: 4 exponent bits, 3 mantissa bits. Range: ±448, precision: ~0.1%. Used for forward pass computations.
- E5M2: 5 exponent bits, 2 mantissa bits. Wider range (±57344) but lower precision. Used for gradient computation where dynamic range matters more.
The challenge with FP8 training is that the reduced precision can cause training instability, especially in operations with large dynamic ranges. DeepSeek solved this through fine-grained quantization: instead of applying a single scaling factor per tensor, they apply per-block scaling factors with a granularity of 1x128 tiles. Each small block of 128 elements gets its own scale factor, allowing different parts of a tensor to use different dynamic ranges.
# Conceptual illustration of fine-grained FP8 quantization
import torch
def quantize_fp8_fine_grained(tensor, block_size=128):
"""
Fine-grained FP8 quantization as used in DeepSeek V3.
Each block of 128 elements gets its own scale factor.
"""
# Reshape into blocks
original_shape = tensor.shape
flat = tensor.reshape(-1)
n_blocks = (flat.numel() + block_size - 1) // block_size
# Pad if needed
padded = torch.zeros(n_blocks * block_size, device=tensor.device)
padded[:flat.numel()] = flat
blocks = padded.reshape(n_blocks, block_size)
# Per-block scaling: find max absolute value per block
max_vals = blocks.abs().max(dim=1, keepdim=True).values
max_vals = max_vals.clamp(min=1e-12)
# E4M3 max representable value is 448
fp8_max = 448.0
scales = max_vals / fp8_max
# Quantize each block with its own scale
quantized = (blocks / scales).clamp(-fp8_max, fp8_max)
# In practice, this is stored as int8 with scale factors
return quantized, scales, original_shape
The result: DeepSeek V3 used approximately 40% less GPU memory during training compared to a BF16 baseline, with no measurable degradation in final model quality. This efficiency gain was essential for making the 671B parameter training run feasible on their hardware budget.
4.3 Auxiliary-Loss-Free MoE Load Balancing
In Mixture of Experts models, a gating network routes each token to a subset of experts. A persistent problem is load imbalance: without intervention, the gating network tends to concentrate tokens on a few "popular" experts while leaving others underutilized. This wastes compute capacity and degrades model quality.
The standard solution is an auxiliary loss that penalizes imbalanced routing. This loss term is added to the main language modeling loss and encourages uniform expert utilization. However, the auxiliary loss introduces a tension: optimizing for balanced routing can conflict with optimizing for language modeling quality. The auxiliary loss coefficient must be carefully tuned, and even with tuning, it subtly degrades the primary training objective.
DeepSeek V3 eliminates this conflict with a novel approach: learnable bias terms added to the gating scores:
The bias terms b are not learned through gradient descent. Instead, they are adjusted dynamically based on observed load statistics: if an expert is overloaded, its bias is decreased; if underloaded, its bias is increased. This adjustment happens outside the gradient computation, meaning the language modeling loss is never contaminated by a balancing objective.
# Auxiliary-loss-free MoE load balancing concept
class AuxLossFreeMoE(nn.Module):
def __init__(self, d_model, n_experts, top_k=2):
super().__init__()
self.n_experts = n_experts
self.top_k = top_k
self.gate = nn.Linear(d_model, n_experts, bias=False)
self.experts = nn.ModuleList([
FeedForward(d_model) for _ in range(n_experts)
])
# Dynamic bias terms (NOT trained by gradient descent)
self.register_buffer(
'expert_bias',
torch.zeros(n_experts)
)
self.bias_update_rate = 0.001
def forward(self, x):
# Compute gating scores with dynamic bias
logits = self.gate(x) + self.expert_bias # bias added here
scores = torch.softmax(logits, dim=-1)
# Select top-k experts per token
top_scores, top_indices = scores.topk(self.top_k, dim=-1)
top_scores = top_scores / top_scores.sum(dim=-1, keepdim=True)
# Route tokens to experts and compute outputs
output = self._route_and_compute(x, top_scores, top_indices)
# Update bias based on load (outside gradient computation)
with torch.no_grad():
load = self._compute_load(top_indices)
target_load = x.shape[0] * self.top_k / self.n_experts
# Decrease bias for overloaded, increase for underloaded
self.expert_bias -= self.bias_update_rate * (load - target_load)
return output
4.4 Multi-Token Prediction (MTP)
Standard language model training uses a next-token prediction objective: given the context, predict the immediately next token. DeepSeek V3 augments this with multi-token prediction, where additional lightweight prediction heads simultaneously predict tokens at positions t+2, t+3, and so on.
The benefit is twofold. First, the multi-token objective provides richer training signal, since the hidden representations must encode information about multiple future tokens rather than just one. This produces more informative internal representations. Second, the additional prediction heads can be repurposed at inference time for speculative decoding, where the draft predictions from these heads are verified in parallel, potentially doubling generation speed.
5. Qwen 2.5: Alibaba's Contender
Alibaba's Qwen (Tongyi Qianwen) 2.5 series offers a comprehensive family spanning 0.5B to 72B parameters. The Qwen family is particularly notable for:
- Strong multilingual performance: Competitive with Llama 3 on English while significantly outperforming it on Chinese, Japanese, Korean, and other Asian languages
- Extended context: Qwen 2.5 supports up to 128K tokens with YaRN (Yet another RoPE extensioN) for efficient position interpolation
- Specialized variants: Qwen-Coder (code generation), Qwen-Math (mathematical reasoning), and Qwen-VL (vision-language multimodal)
- Permissive licensing: Apache 2.0 for most model sizes, enabling unrestricted commercial use
6. Microsoft Phi: Small but Capable
Microsoft's Phi series challenges the assumption that bigger is always better. The Phi models use knowledge distillation and curated high-quality training data to achieve performance that punches far above their parameter count:
| Model | Parameters | Key Innovation | Performance Note |
|---|---|---|---|
| Phi-3 Mini | 3.8B | Curated "textbook quality" data | Matches Llama 3 8B on some benchmarks |
| Phi-3 Small | 7B | Data quality + distillation | Competitive with Mixtral 8x7B |
| Phi-3 Medium | 14B | Balanced size/quality | Approaches GPT-4o mini capability |
| Phi-4 | 14B | Synthetic data from GPT-4 | Strong reasoning, code, math |
The Phi approach demonstrates that training data quality can partially compensate for model size. By training on carefully curated, information-dense data (including synthetic data generated by larger models), Phi models achieve a higher "knowledge per parameter" ratio than models trained on raw web crawls.
7. Google Gemma: Open Models from DeepMind
Google's Gemma family brings DeepMind's research into the open-weight ecosystem. Gemma 2 (2024) was released at 2B, 9B, and 27B parameter sizes, trained using techniques from the larger Gemini models. Gemma 3 (2025) expanded to multimodal capabilities, accepting both text and image inputs.
Key characteristics of the Gemma family:
- Architecture: Decoder-only transformer with GQA, RoPE positional encoding, and GeGLU activation
- Knowledge distillation: Smaller Gemma models benefit from distillation from larger Gemini models
- Licensing: Gemma uses a permissive license (Gemma Terms of Use) that allows commercial use without usage-based restrictions, distinguishing it from Llama's 700M MAU threshold
- Competitive positioning: Gemma 2 27B competes directly with Llama 3 8B and Mistral 7B, consistently ranking well on the Open LLM Leaderboard for its size class
8. Specialized Open Models
Code Models
CodeLlama and StarCoder2 are fine-tuned specifically for code generation. CodeLlama extends Llama with additional training on code-heavy data and supports infilling (generating code to fill a gap between existing code). StarCoder2, from the BigCode project, was trained on The Stack v2 with over 600 programming languages.
Vision-Language Models
LLaVA (Large Language and Vision Assistant) demonstrates the visual instruction tuning approach: connect a pre-trained vision encoder (CLIP) to a language model through a projection layer, then fine-tune on visual question-answering data. This modular approach has spawned many variants and remains a popular architecture for open multimodal models.
Speech Models
Whisper from OpenAI (released with open weights) provides robust speech recognition across 99 languages. Its encoder-decoder architecture processes mel spectrograms and generates text, with optional timestamp prediction for alignment.
8. The Hugging Face Ecosystem
No discussion of open models is complete without the Hugging Face ecosystem, which provides the infrastructure for discovering, downloading, and deploying models:
# Loading and running an open-weight model with Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Download and load Llama 3 8B (requires access approval)
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # Half precision for memory
device_map="auto", # Automatic GPU placement
)
# Format a chat message
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain MoE in 3 sentences."}
]
input_ids = tokenizer.apply_chat_template(
messages, return_tensors="pt"
).to(model.device)
# Generate a response
output = model.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
)
response = tokenizer.decode(
output[0][input_ids.shape[1]:],
skip_special_tokens=True
)
print(response)
The Hugging Face ecosystem includes:
- Model Hub: Over 500,000 models with standardized APIs, model cards, and community discussions
- Transformers library: Unified Python API for loading and running models from any major architecture
- Datasets library: Standardized access to training and evaluation datasets
- Spaces: Hosted applications for interactive model demos using Gradio or Streamlit
- PEFT: Parameter-Efficient Fine-Tuning methods (LoRA, QLoRA) for adapting large models on consumer hardware
9. Lab: Running Models Locally
For local inference on consumer hardware, llama.cpp and its wrapper Ollama provide optimized C++ inference with quantized models:
# Using Ollama to run models locally
# Install: https://ollama.ai
# Pull and run Llama 3 8B (quantized to 4-bit, ~4.7GB)
# Terminal command:
# ollama pull llama3
# ollama run llama3
# Programmatic access via Python
import requests
def query_ollama(prompt, model="llama3"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False,
}
)
return response.json()["response"]
# Compare 8B local vs 70B via API
local_response = query_ollama(
"What are the advantages of Mixture of Experts models?"
)
print("Local 8B response:")
print(local_response)
Section 7.2 Quiz
Reveal Answer
Reveal Answer
Reveal Answer
Reveal Answer
Reveal Answer
Key Takeaways
- Open-weight models now approach frontier closed-source capability on many tasks. The gap has closed dramatically since 2023, driven by innovations in architecture, training data, and efficiency.
- Mixture of Experts is the dominant scaling strategy for both open and closed models, decoupling total knowledge capacity from per-token inference cost.
- DeepSeek V3's four innovations represent the state of the art in efficient large-scale training: MLA for KV cache compression (93% reduction), FP8 for memory-efficient training, auxiliary-loss-free MoE for clean optimization, and multi-token prediction for richer representations.
- Data quality can partially compensate for model size, as demonstrated by the Phi series, which achieves strong performance at 3.8B-14B parameters through curated and synthetic training data.
- The Hugging Face ecosystem provides the essential infrastructure (Model Hub, Transformers, Datasets, Spaces) that makes open-weight models practical for production use.
- Local inference is now practical through quantization and optimized runtimes (llama.cpp, Ollama), enabling 8B-parameter models to run on consumer laptops.