This section is a coding lab. By the end you will have a working character-level language model built on a decoder-only Transformer. Every line of code is explained. We encourage you to type the code yourself rather than copy-pasting; the act of typing builds muscle memory for these patterns.
1. What We Are Building
We will implement a decoder-only Transformer (the GPT architecture) that performs character-level language modeling. Given a sequence of characters, the model predicts the next character at every position. We choose character-level modeling because it eliminates the need for a tokenizer, letting us focus entirely on the architecture.
Our model will have these hyperparameters:
| Hyperparameter | Value | Notes |
|---|---|---|
| d_model | 128 | Embedding and residual stream dimension |
| n_heads | 4 | Number of attention heads (d_k = 32) |
| n_layers | 4 | Number of Transformer blocks |
| d_ff | 512 | Feed-forward inner dimension (4 × d_model) |
| block_size | 128 | Maximum context length |
| vocab_size | ~65 | Unique characters in the dataset |
| dropout | 0.1 | Dropout rate |
This is a small model (~1.6M parameters) that trains in a few minutes on a single GPU (or even on CPU for a few epochs). The architecture is identical to GPT; only the scale differs.
2. The Complete Implementation
Below is the full model in a single file. We break it into logical pieces and explain each one. The complete code (all pieces assembled) is approximately 300 lines including comments.
2.1 Imports and Configuration
"""
mini_transformer.py
A minimal decoder-only Transformer for character-level language modeling.
~300 lines of annotated PyTorch.
"""
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
@dataclass
class TransformerConfig:
"""All hyperparameters in one place."""
vocab_size: int = 65 # number of unique characters
block_size: int = 128 # maximum context length
n_layers: int = 4 # number of Transformer blocks
n_heads: int = 4 # number of attention heads
d_model: int = 128 # embedding / residual stream dimension
d_ff: int = 512 # feed-forward inner dimension
dropout: float = 0.1 # dropout probability
bias: bool = False # use bias in Linear layers?
We use a dataclass so that every hyperparameter is explicit, documented, and easy to
modify. Setting bias=False follows the LLaMA convention and marginally reduces
parameter count.
2.2 Causal Self-Attention
class CausalSelfAttention(nn.Module):
"""Multi-head causal (masked) self-attention."""
def __init__(self, config: TransformerConfig):
super().__init__()
assert config.d_model % config.n_heads == 0
# Key, Query, Value projections combined into one matrix
self.qkv_proj = nn.Linear(config.d_model, 3 * config.d_model, bias=config.bias)
# Output projection
self.out_proj = nn.Linear(config.d_model, config.d_model, bias=config.bias)
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_heads = config.n_heads
self.d_model = config.d_model
self.d_k = config.d_model // config.n_heads
# Causal mask: lower-triangular boolean matrix
# Register as buffer so it moves to GPU with the model
mask = torch.tril(torch.ones(config.block_size, config.block_size))
self.register_buffer("mask", mask.view(1, 1, config.block_size, config.block_size))
def forward(self, x):
B, T, C = x.shape # batch, sequence length, d_model
# Compute Q, K, V in one matrix multiply, then split
qkv = self.qkv_proj(x)
q, k, v = qkv.split(self.d_model, dim=2)
# Reshape for multi-head: (B, T, C) -> (B, n_heads, T, d_k)
q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention
# (B, n_heads, T, d_k) @ (B, n_heads, d_k, T) -> (B, n_heads, T, T)
scores = (q @ k.transpose(-2, -1)) * (self.d_k ** -0.5)
# Apply causal mask: positions beyond current token get -inf
scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
attn_weights = F.softmax(scores, dim=-1)
attn_weights = self.attn_dropout(attn_weights)
# Weighted sum of values
# (B, n_heads, T, T) @ (B, n_heads, T, d_k) -> (B, n_heads, T, d_k)
out = attn_weights @ v
# Concatenate heads: (B, n_heads, T, d_k) -> (B, T, C)
out = out.transpose(1, 2).contiguous().view(B, T, C)
# Final linear projection + dropout
return self.resid_dropout(self.out_proj(out))
We compute Q, K, and V with a single linear layer (qkv_proj) of size
d_model → 3 * d_model and then split the output into three equal parts.
This is mathematically identical to three separate linear layers but is more efficient because
it performs one large matrix multiply instead of three smaller ones. The GPU utilizes its
parallelism more effectively with larger matrices.
2.3 Feed-Forward Network
class FeedForward(nn.Module):
"""Position-wise feed-forward network with ReLU activation."""
def __init__(self, config: TransformerConfig):
super().__init__()
self.fc1 = nn.Linear(config.d_model, config.d_ff, bias=config.bias)
self.fc2 = nn.Linear(config.d_ff, config.d_model, bias=config.bias)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
x = self.dropout(x)
return x
This is the simplest version. For a more advanced variant, you can swap in SwiGLU:
class SwiGLUFeedForward(nn.Module):
"""SwiGLU feed-forward (used in LLaMA, PaLM)."""
def __init__(self, config: TransformerConfig):
super().__init__()
# SwiGLU uses 3 weight matrices instead of 2
# To keep param count comparable, the hidden dim is often 2/3 of d_ff
hidden = int(2 * config.d_ff / 3)
self.w1 = nn.Linear(config.d_model, hidden, bias=config.bias)
self.w2 = nn.Linear(hidden, config.d_model, bias=config.bias)
self.w3 = nn.Linear(config.d_model, hidden, bias=config.bias) # gate
self.dropout = nn.Dropout(config.dropout)
def forward(self, x):
# SiLU(x * W1) * (x * W3) then project back
return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))
2.4 Transformer Block
class TransformerBlock(nn.Module):
"""A single Transformer block with Pre-LN ordering."""
def __init__(self, config: TransformerConfig):
super().__init__()
self.ln1 = nn.LayerNorm(config.d_model)
self.attn = CausalSelfAttention(config)
self.ln2 = nn.LayerNorm(config.d_model)
self.ffn = FeedForward(config)
def forward(self, x):
# Pre-LN: normalize before each sub-layer
x = x + self.attn(self.ln1(x)) # residual + attention
x = x + self.ffn(self.ln2(x)) # residual + FFN
return x
This is remarkably simple. Two lines of actual computation, each following the pattern:
x = x + SubLayer(LayerNorm(x)). The residual connection is the x + at the
beginning; the Pre-LN ordering means we normalize the input to each sub-layer, not the output.
2.5 The Complete Model
class MiniTransformer(nn.Module):
"""Decoder-only Transformer for character-level language modeling."""
def __init__(self, config: TransformerConfig):
super().__init__()
self.config = config
# Token and position embeddings
self.token_emb = nn.Embedding(config.vocab_size, config.d_model)
self.pos_emb = nn.Embedding(config.block_size, config.d_model)
self.drop = nn.Dropout(config.dropout)
# Stack of Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(config) for _ in range(config.n_layers)
])
# Final layer norm (needed with Pre-LN)
self.ln_final = nn.LayerNorm(config.d_model)
# Output head: project from d_model to vocab_size
self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
# Weight tying: share embedding and output weights
self.token_emb.weight = self.lm_head.weight
# Initialize weights
self.apply(self._init_weights)
# Scale residual projections
for block in self.blocks:
nn.init.normal_(
block.attn.out_proj.weight,
mean=0.0,
std=0.02 / math.sqrt(2 * config.n_layers)
)
nn.init.normal_(
block.ffn.fc2.weight,
mean=0.0,
std=0.02 / math.sqrt(2 * config.n_layers)
)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.LayerNorm):
nn.init.ones_(module.weight)
nn.init.zeros_(module.bias)
def forward(self, idx, targets=None):
"""
Args:
idx: (B, T) tensor of token indices
targets: (B, T) tensor of target token indices (optional)
Returns:
logits: (B, T, vocab_size)
loss: scalar cross-entropy loss (only if targets provided)
"""
B, T = idx.shape
assert T <= self.config.block_size, \
f"Sequence length {T} exceeds block_size {self.config.block_size}"
# Token embeddings + positional embeddings
positions = torch.arange(0, T, device=idx.device) # (T,)
x = self.token_emb(idx) + self.pos_emb(positions) # (B, T, d_model)
x = self.drop(x)
# Pass through all Transformer blocks
for block in self.blocks:
x = block(x)
# Final normalization
x = self.ln_final(x)
# Project to vocabulary
logits = self.lm_head(x) # (B, T, vocab_size)
# Compute loss if targets are provided
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""
Auto-regressive generation.
Args:
idx: (B, T) conditioning sequence
max_new_tokens: number of tokens to generate
temperature: softmax temperature (lower = more deterministic)
top_k: if set, only sample from top-k most likely tokens
"""
for _ in range(max_new_tokens):
# Crop context to block_size if needed
idx_cond = idx[:, -self.config.block_size:]
# Forward pass
logits, _ = self(idx_cond)
# Take logits at the last position and apply temperature
logits = logits[:, -1, :] / temperature
# Optional top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Sample from the distribution
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
idx = torch.cat([idx, next_token], dim=1)
return idx
The line self.token_emb.weight = self.lm_head.weight shares the embedding matrix
with the output projection. This is standard practice in language models. It means the model
uses the same representation for "what does this token mean?" (embedding) and "what token should
come next?" (output logits). This reduces parameter count by
vocab_size × d_model and provides a regularization effect.
Press and Wolf showed that tying the input embedding and output projection weights is not just a memory optimization; it acts as a regularizer that improves perplexity. The intuition: by forcing the model to use a single vector space for both input and output, it learns embeddings where tokens that should be predicted in similar contexts also have similar input representations. For a 50K vocabulary with d=512, weight tying saves 25 million parameters. Nearly all modern language models (GPT-2, GPT-3, LLaMA, Mistral) use this technique.
Press, O. & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017.
3. Data Preparation
For training, we use a simple character-level dataset. Any plain text file will work. We will use a small text corpus (a few hundred KB) for quick experimentation.
class CharDataset:
"""Character-level dataset that produces (input, target) pairs."""
def __init__(self, text, block_size):
self.block_size = block_size
# Build character vocabulary
chars = sorted(set(text))
self.vocab_size = len(chars)
self.stoi = {ch: i for i, ch in enumerate(chars)}
self.itos = {i: ch for ch, i in self.stoi.items()}
# Encode entire text as integers
self.data = torch.tensor([self.stoi[c] for c in text], dtype=torch.long)
def __len__(self):
return len(self.data) - self.block_size
def __getitem__(self, idx):
chunk = self.data[idx : idx + self.block_size + 1]
x = chunk[:-1] # input: characters 0..block_size-1
y = chunk[1:] # target: characters 1..block_size
return x, y
def decode(self, indices):
"""Convert list of integer indices back to string."""
return ''.join(self.itos[i] for i in indices)
def encode(self, text):
"""Convert string to list of integer indices."""
return [self.stoi[c] for c in text]
4. The Training Loop
Next-token prediction is classification. At each position in the sequence, the model
performs a V-way classification over the entire vocabulary, where V is the vocabulary size.
The cross-entropy loss from Section 0.1 applies directly here: we compare the model's
predicted probability distribution over all possible next tokens against the one-hot target
(the actual next token in the training data). This is why the code below uses
F.cross_entropy to compute the loss, treating every position as an independent
classification problem.
import time
from torch.utils.data import DataLoader
def train(config=None, text_path='input.txt', max_steps=5000,
batch_size=64, learning_rate=3e-4, eval_interval=500):
"""Complete training procedure."""
# ---- Setup ----
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")
# Load text data
with open(text_path, 'r', encoding='utf-8') as f:
text = f.read()
# Create dataset
dataset = CharDataset(text, block_size=128)
print(f"Vocabulary size: {dataset.vocab_size}")
print(f"Dataset size: {len(dataset):,} examples")
# Create config with correct vocab size
if config is None:
config = TransformerConfig(vocab_size=dataset.vocab_size)
else:
config.vocab_size = dataset.vocab_size
# Create model
model = MiniTransformer(config).to(device)
n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params:,}")
# Optimizer
optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
betas=(0.9, 0.95),
weight_decay=0.1
)
# Data loader
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True,
num_workers=0, pin_memory=True)
data_iter = iter(loader)
# ---- Training ----
model.train()
t0 = time.time()
for step in range(max_steps):
# Get batch (cycle through data)
try:
xb, yb = next(data_iter)
except StopIteration:
data_iter = iter(loader)
xb, yb = next(data_iter)
xb, yb = xb.to(device), yb.to(device)
# Forward pass
logits, loss = model(xb, yb)
# Backward pass
optimizer.zero_grad(set_to_none=True)
loss.backward()
# Gradient clipping (standard practice)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
# Logging
if step % eval_interval == 0 or step == max_steps - 1:
dt = time.time() - t0
print(f"step {step:5d} | loss {loss.item():.4f} | "
f"time {dt:.1f}s")
# ---- Generation ----
model.eval()
prompt = "\n"
context = torch.tensor(
[dataset.encode(prompt)], dtype=torch.long, device=device
)
generated = model.generate(context, max_new_tokens=500, temperature=0.8)
print("\n" + "=" * 50)
print("Generated text:")
print("=" * 50)
print(dataset.decode(generated[0].tolist()))
return model, dataset
if __name__ == '__main__':
train()
Gradient clipping (clip_grad_norm_ with max_norm=1.0) prevents
training instability from occasional large gradient spikes. This is standard in all Transformer
training pipelines.
AdamW (Adam with decoupled weight decay) is the optimizer of choice. The betas (0.9, 0.95) and weight_decay (0.1) follow common LLM training conventions. The learning rate 3e-4 works well for small models; larger models typically use lower rates with warmup schedules.
5. Understanding the Shapes
Tracking tensor shapes is one of the most valuable debugging skills when working with Transformers. Here is a shape trace through the forward pass:
| Variable | Shape | Description |
|---|---|---|
idx | (B, T) | Input token indices |
token_emb(idx) | (B, T, d_model) | Token embeddings |
pos_emb(positions) | (T, d_model) | Positional embeddings (broadcast over B) |
x after embedding | (B, T, d_model) | Sum of token + position embeddings |
qkv | (B, T, 3*d_model) | Fused QKV projection output |
q, k, v after reshape | (B, n_heads, T, d_k) | Per-head queries, keys, values |
scores | (B, n_heads, T, T) | Attention scores (before masking) |
attn_weights | (B, n_heads, T, T) | Attention probabilities (after softmax) |
out from attention | (B, T, d_model) | Concatenated head outputs after out_proj |
ffn output | (B, T, d_model) | Feed-forward output |
logits | (B, T, vocab_size) | Raw prediction scores for each position |
The attention scores have shape (B, n_heads, T, T). This is where the quadratic
cost of attention lives. For T=128, this is 128 × 128 = 16,384 entries per head per
example. For T=4096 (a moderate context window), that grows to 16.7 million. Section 4.3 covers
techniques to reduce this cost.
6. Running the Lab
6.1 Getting Data
Download a small text file for training. Shakespeare's collected works (~1.1 MB) is the classic choice:
# Download the tiny Shakespeare dataset import urllib.request url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt" urllib.request.urlretrieve(url, "input.txt")
6.2 Training
# Train with default settings model, dataset = train(max_steps=5000) # Expected output after 5000 steps (loss around 1.4-1.5): # step 0 | loss 4.1742 | time 0.0s # step 500 | loss 1.9831 | time 12.3s # step 1000 | loss 1.6524 | time 24.7s # ... # step 5000 | loss 1.4208 | time 123.5s
6.3 Evaluating the Output
After training, the model will generate text that resembles the style of the training data. At ~5000 steps with our small model, you should see recognizable words, approximate sentence structure, and character-level patterns that match the training corpus. The text will not be coherent, but it should clearly be "trying" to produce English in the style of the training data.
6.4 Experiments to Try
- Increase n_layers from 4 to 6 or 8. Does the loss improve? How much slower is training?
- Increase d_model from 128 to 256. Compare the parameter count and training speed.
- Remove positional embeddings entirely. What happens to the generated text?
- Switch to SwiGLU (replace
FeedForwardwithSwiGLUFeedForward). Does the loss curve change? - Remove the causal mask. The model can now "cheat" by looking at future tokens. What happens to the training loss? What happens to generation quality?
- Try temperature values of 0.5, 1.0, and 1.5 during generation. Observe the diversity/quality tradeoff.
7. Common Bugs and Debugging
When implementing Transformers from scratch, certain bugs appear repeatedly. Here are the most common ones and how to detect them:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss stays flat at ~ln(vocab_size) | Gradients are not flowing; possible shape mismatch or detached computation | Check that no .detach() calls break the computation graph. Verify loss computation. |
| Loss drops fast then NaN | Learning rate too high or no gradient clipping | Add gradient clipping (max_norm=1.0). Reduce learning rate. Check for missing layer norm. |
| Generated text is repetitive gibberish | Missing or incorrect causal mask | Verify the mask is lower-triangular and correctly applied before softmax. |
| Generated text is random characters | Insufficient training or broken positional encoding | Train longer. Verify pos_emb is added, not concatenated. |
| All generated tokens are the same | Temperature too low or top_k=1 | Increase temperature. Use top_k > 1 or remove top_k filtering. |
Before training on the full dataset, verify your model can overfit a single batch. Take one batch of data and train for 100 steps. The loss should drop to near zero. If it does not, there is a bug in your model or training loop. This simple sanity check saves hours of debugging.
Key Takeaways
- A decoder-only Transformer can be implemented in ~300 lines of clear, modular PyTorch.
- The architecture has four main components: embeddings, causal self-attention, feed-forward networks, and layer normalization, all connected by residual connections.
- Fused QKV projections and weight tying are standard efficiency tricks with no loss in model quality.
- Careful initialization (especially scaling residual projections) is critical for stable training.
- Gradient clipping, AdamW with appropriate hyperparameters, and the Pre-LN ordering are standard practice.
- Tracking tensor shapes through the forward pass is the single most effective debugging technique.
Check Your Understanding
1. Why do we combine Q, K, V into a single linear projection rather than using three separate layers?
Show Answer
2. What does weight tying do and why is it beneficial?
Show Answer
3. Why is the final LayerNorm necessary in a Pre-LN Transformer?
Show Answer
4. What would happen if you removed the causal mask during training?
Show Answer
5. The attention scores tensor has shape (B, n_heads, T, T). What does each element represent?