Module 00 · Section 0.3

PyTorch Tutorial

From tensors to trained models: a hands-on guide to the framework behind modern LLMs

I used to write for loops. Then I discovered tensors, and now I judge everyone who still writes for loops.

Vectorized Vanessa, a reformed Python scripter
◆ Big Picture

PyTorch is the language we will use to build, train, and understand LLMs throughout this course. Every transformer layer, every attention head, and every training loop in the chapters ahead will be expressed in PyTorch. Investing time here pays compound interest in every module that follows.

PyTorch is a Python library for numerical computation on tensors with two superpowers: automatic differentiation and seamless GPU acceleration. If NumPy gives you a fast calculator, PyTorch gives you a fast calculator that can also compute its own derivatives and run on a graphics card. This section walks through every concept you need, starting from the lowest level (tensors) and building up to a complete training pipeline.

1. Tensors: The Fundamental Data Structure

A tensor is a multi-dimensional array. Scalars, vectors, matrices, and higher-dimensional arrays are all tensors. PyTorch tensors behave like NumPy arrays but carry extra metadata: a dtype, a device (CPU or GPU), and an optional link to a computational graph for gradient computation.

1.1 Creating Tensors

import torch

# From Python lists
a = torch.tensor([1.0, 2.0, 3.0])
print(a, a.dtype)

# Common factory functions
zeros = torch.zeros(2, 3)           # 2x3 of zeros
ones  = torch.ones(2, 3)            # 2x3 of ones
rand  = torch.randn(2, 3)           # 2x3 from N(0,1)
seq   = torch.arange(0, 10, 2)      # [0, 2, 4, 6, 8]

# From NumPy (shares memory; no copy!)
import numpy as np
np_arr = np.array([1, 2, 3])
t = torch.from_numpy(np_arr)
print(t)
Output tensor([1., 2., 3.]) torch.float32 tensor([1, 2, 3])
◆ Key Insight

PyTorch defaults to float32 for floating-point tensors. This matters because GPUs are optimized for 32-bit arithmetic, and most deep learning happens at this precision. When you need to save memory (as we will with large language models), you can use float16 or bfloat16.

1.2 Indexing, Slicing, and Reshaping

x = torch.arange(12).reshape(3, 4)
print("Original:\n", x)
print("Row 0:   ", x[0])          # first row
print("Col 1:   ", x[:, 1])       # second column
print("Subset:  ", x[0:2, 1:3])  # rows 0-1, cols 1-2

# Reshape vs. View
flat = x.view(-1)               # flatten (must be contiguous)
print("Flat:    ", flat)

# Unsqueeze / Squeeze for adding/removing dimensions
row = torch.tensor([1, 2, 3])
print("Shape before unsqueeze:", row.shape)
print("Shape after unsqueeze(0):", row.unsqueeze(0).shape)
Output Original: tensor([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) Row 0: tensor([0, 1, 2, 3]) Col 1: tensor([1, 5, 9]) Subset: tensor([[1, 2], [5, 6]]) Flat: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) Shape before unsqueeze: torch.Size([3]) Shape after unsqueeze(0): torch.Size([1, 3])

1.3 Broadcasting

Broadcasting lets PyTorch perform element-wise operations on tensors of different shapes by automatically expanding dimensions. The rules mirror NumPy: dimensions are compared from right to left, and a dimension of size 1 is stretched to match the other tensor.

# Add a row vector to every row of a matrix
matrix = torch.ones(3, 3)
row_vec = torch.tensor([10, 20, 30])
result = matrix + row_vec      # row_vec broadcasts across dim 0
print(result)
Output tensor([[11., 21., 31.], [11., 21., 31.], [11., 21., 31.]])
⚠ Warning: Silent Shape Bugs

Broadcasting can mask bugs. If you add tensors of shapes (3, 1) and (1, 4), PyTorch happily produces a (3, 4) result with no error. Always verify shapes with print(tensor.shape) when debugging unexpected results.

1.4 Device Management (CPU/GPU)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Move tensors to the chosen device
x = torch.randn(3, 3, device=device)

# Or move an existing tensor
y = torch.randn(3, 3).to(device)

# Operations require BOTH tensors on the same device
z = x + y  # works because both on 'device'
✗ Common Mistake: Device Mismatch

Trying cpu_tensor + gpu_tensor raises RuntimeError: Expected all tensors to be on the same device. The fix: move everything to the same device before operating. A good pattern is to define device once at the top of your script and use .to(device) everywhere.

2. Autograd: Automatic Differentiation

Autograd is PyTorch's engine for computing gradients. When you set requires_grad=True on a tensor, PyTorch records every operation performed on it in a directed acyclic graph (DAG). Calling .backward() on the final scalar output traverses that graph in reverse to compute the gradient of the output with respect to every leaf tensor.

2.1 A Minimal Example

x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x + 1   # y = x^2 + 2x + 1
y.backward()            # dy/dx = 2x + 2 = 8 at x=3
print(x.grad)
Output tensor(8.)

2.2 The Computational Graph

Every operation creates a node in the graph. Intermediate tensors store a .grad_fn that records how they were created. The graph below shows what happens for a simple loss computation.

x (leaf) w (leaf) MulBackward AddBackward b (leaf) MSELoss target (const) .backward() traverses this graph right-to-left z = w * x y = z + b loss
Figure 1. Computational graph for a linear operation followed by MSE loss. Leaf tensors (blue) have requires_grad=True. Yellow nodes record the operation for backward traversal.

2.3 Gradient Accumulation

Gradients in PyTorch accumulate by default. If you call .backward() twice without zeroing gradients, the second set of gradients is added to the first. This is intentional (it enables gradient accumulation across mini-batches), but forgetting to zero gradients is the most common autograd bug.

x = torch.tensor(2.0, requires_grad=True)

# First forward + backward
y = x * 3
y.backward()
print("After 1st backward:", x.grad)   # 3.0

# Second forward + backward WITHOUT zeroing
y = x * 3
y.backward()
print("After 2nd backward:", x.grad)   # 6.0 (accumulated!)

# The fix: always zero gradients before each backward pass
x.grad.zero_()
y = x * 3
y.backward()
print("After zeroing:     ", x.grad)   # 3.0
Output After 1st backward: tensor(3.) After 2nd backward: tensor(6.) After zeroing: tensor(3.)
ⓘ Note: torch.no_grad()

During inference (or any time you do not need gradients), wrap your code in with torch.no_grad():. This disables graph construction, reduces memory usage, and speeds up computation. You will see this in every evaluation loop.

3. Building Models with nn.Module

Raw tensors and autograd are powerful, but PyTorch provides torch.nn to organize parameters, layers, and forward computations into reusable modules. Every model you build in this course, from simple classifiers to full transformer architectures, will subclass nn.Module.

3.1 Your First nn.Module

import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = SimpleNet(input_dim=784, hidden_dim=128, output_dim=10)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
Output SimpleNet( (fc1): Linear(in_features=784, out_features=128, bias=True) (relu): ReLU() (fc2): Linear(in_features=128, out_features=10, bias=True) ) Total parameters: 101,770
◆ Key Insight

The __init__ method declares layers; the forward method defines the computation. Never call model.forward(x) directly. Instead, call model(x), which runs forward along with any registered hooks.

4. Data Loading: Dataset and DataLoader

PyTorch decouples data storage from data loading through two abstractions. Dataset defines how to access individual samples. DataLoader wraps a dataset to provide batching, shuffling, and parallel loading.

from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
from torchvision.datasets import FashionMNIST

# Define a transform pipeline
transform = transforms.Compose([
    transforms.ToTensor(),                 # PIL image -> tensor, scales to [0,1]
    transforms.Normalize((0.2860,), (0.3530,))  # FashionMNIST stats
])

# Download and load training data
train_dataset = FashionMNIST(
    root="./data", train=True, download=True, transform=transform
)

# Create a DataLoader
train_loader = DataLoader(
    train_dataset, batch_size=64, shuffle=True, num_workers=2
)

# Iterate to see the shape of a batch
images, labels = next(iter(train_loader))
print(f"Batch images shape: {images.shape}")
print(f"Batch labels shape: {labels.shape}")
Output Batch images shape: torch.Size([64, 1, 28, 28]) Batch labels shape: torch.Size([64])

4.1 Custom Datasets

When your data is not a standard benchmark, subclass Dataset and implement __len__ and __getitem__:

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

5. The Training Loop

Training a neural network follows a rhythmic four-step pattern: forward pass, compute loss, backward pass, optimizer step. Every training loop you write, from a simple classifier to a billion-parameter LLM, follows this same skeleton.

1. Forward Pass predictions = model(x) 2. Compute Loss loss = criterion(pred, y) 3. Backward Pass loss.backward() 4. Optimizer Step optimizer.step() 0. Zero Gradients optimizer.zero_grad() repeat Full Training Script Structure for epoch in range(num_epochs): for batch_x, batch_y in train_loader: zero_grad → forward → loss → backward → step
Figure 2. The canonical training loop. Step 0 (zero gradients) prevents gradient accumulation. Steps 1 through 4 repeat for every mini-batch in every epoch.

5.1 Complete Training Loop

Understanding Optimizers: SGD, Adam, and AdamW

Before we write our first training loop, let us understand the optimizer that drives learning. Momentum smooths out noisy gradients by maintaining an exponential moving average of past gradients, preventing the optimizer from oscillating on noisy surfaces. Adaptive learning rates give each parameter its own learning rate, scaled by the history of its gradients; parameters with consistently large gradients get smaller steps, and vice versa. Adam combines both ideas. AdamW improves on Adam by decoupling weight decay from the gradient update, which produces better generalization and is now the preferred optimizer for training large language models.

OptimizerLearning RateMomentumWeight DecayBest For
SGDSingle global rateOptional (off by default)Coupled with gradientConvex problems, fine control
AdamPer-parameter adaptiveBuilt in (first moment)Coupled with gradientFast prototyping, general use
AdamWPer-parameter adaptiveBuilt in (first moment)Decoupled (proper regularization)LLM pretraining, best generalization
import torch
import torch.nn as nn
import torch.optim as optim

# Assume model, train_loader, device are already defined
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

num_epochs = 3
for epoch in range(num_epochs):
    model.train()                         # set training mode
    running_loss = 0.0

    for batch_idx, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)

        # Flatten 28x28 images to vectors of length 784
        images = images.view(images.size(0), -1)

        # Step 0: Zero gradients from previous step
        optimizer.zero_grad()

        # Step 1: Forward pass
        outputs = model(images)

        # Step 2: Compute loss
        loss = criterion(outputs, labels)

        # Step 3: Backward pass (compute gradients)
        loss.backward()

        # Step 4: Update weights
        optimizer.step()

        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")
Output (approximate) Epoch [1/3], Loss: 0.5312 Epoch [2/3], Loss: 0.3845 Epoch [3/3], Loss: 0.3421
⚠ Warning: model.train() vs model.eval()

Always call model.train() before training and model.eval() before evaluation. These toggle behaviors of layers like Dropout and BatchNorm. Forgetting model.eval() during validation leads to noisy, unreliable metrics.

6. Saving and Loading Models

PyTorch stores learned parameters in a dictionary called the state_dict. Saving the state dict (rather than the full model object) is the recommended approach because it is architecture-independent and portable.

# Save model weights
torch.save(model.state_dict(), "model_weights.pth")

# Load into a fresh model instance
loaded_model = SimpleNet(input_dim=784, hidden_dim=128, output_dim=10)
loaded_model.load_state_dict(torch.load("model_weights.pth", weights_only=True))
loaded_model.eval()

# Save a full checkpoint (weights + optimizer + epoch) for resumable training
checkpoint = {
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "loss": avg_loss,
}
torch.save(checkpoint, "checkpoint.pth")

# Resume from checkpoint
ckpt = torch.load("checkpoint.pth", weights_only=True)
model.load_state_dict(ckpt["model_state_dict"])
optimizer.load_state_dict(ckpt["optimizer_state_dict"])
start_epoch = ckpt["epoch"] + 1
ⓘ Note

Always pass weights_only=True to torch.load() in modern PyTorch (1.13+). This prevents arbitrary code execution from untrusted checkpoint files. If you need to load optimizer state or other non-tensor data, use weights_only=False only with files you trust.

7. Debugging: Hooks, Gradient Inspection, and Profiling

When your model does not train, you need tools to look inside. PyTorch provides several mechanisms for introspection.

7.1 Inspecting Gradients

# Check gradients after a backward pass
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name:20s} grad mean={param.grad.mean():.6f}  "
              f"std={param.grad.std():.6f}")
Output (example) fc1.weight grad mean=-0.000132 std=0.004521 fc1.bias grad mean=-0.001207 std=0.000000 fc2.weight grad mean=0.000041 std=0.012843 fc2.bias grad mean=0.000523 std=0.000000

7.2 Forward and Backward Hooks

Hooks let you inspect (or modify) data flowing through a module without changing its code. This is invaluable for debugging and later for techniques like activation patching in interpretability research.

# Register a forward hook that prints the output shape
def print_shape_hook(module, input, output):
    print(f"{module.__class__.__name__:15s} output shape: {output.shape}")

hooks = []
for name, layer in model.named_children():
    h = layer.register_forward_hook(print_shape_hook)
    hooks.append(h)

# Run one forward pass to see shapes
dummy = torch.randn(1, 784).to(device)
_ = model(dummy)

# Clean up hooks when done
for h in hooks:
    h.remove()
Output Linear output shape: torch.Size([1, 128]) ReLU output shape: torch.Size([1, 128]) Linear output shape: torch.Size([1, 10])

7.3 Profiling with torch.profiler

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    for i, (images, labels) in enumerate(train_loader):
        images = images.view(images.size(0), -1)
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        if i >= 4:
            break

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=5))
◆ Key Insight

Profiling reveals where time is actually spent. In small models, data loading often dominates. In larger models, matrix multiplications dominate. Knowing this guides your optimization effort: increase num_workers for data-bound training, or use mixed precision for compute-bound training.

8. Common Mistakes and How to Fix Them

SymptomCauseFix
RuntimeError: mat1 and mat2 shapes cannot be multiplied Input tensor shape does not match the layer's expected input dimension Print shapes with print(x.shape) before each layer; ensure you flatten or reshape correctly
Loss is nan after a few steps Learning rate is too high, or numerical overflow Lower the learning rate; add gradient clipping with torch.nn.utils.clip_grad_norm_
Loss never decreases Forgot optimizer.zero_grad() or wrong loss function Verify the training loop skeleton; try overfitting on a single batch first
Expected all tensors to be on the same device Model is on GPU but data is on CPU (or vice versa) Call .to(device) on both model and data
Validation accuracy worse than training Forgot model.eval() or torch.no_grad() Always wrap evaluation in model.eval() and with torch.no_grad():

9. Lab: Build and Train a FashionMNIST Classifier

Let us put everything together. In this lab you will build a fully connected neural network that classifies FashionMNIST images into 10 categories (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot). The complete script below is copy-pasteable and runnable.

#!/usr/bin/env python3
"""Lab 0.3: FashionMNIST Classifier in PyTorch (from scratch)."""

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# ── Hyperparameters ──────────────────────────────────────────
BATCH_SIZE   = 64
LEARNING_RATE = 1e-3
NUM_EPOCHS   = 10
HIDDEN_DIM   = 256

# ── Device ───────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Training on: {device}")

# ── Data ─────────────────────────────────────────────────────
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.2860,), (0.3530,)),
])

train_data = datasets.FashionMNIST("./data", train=True,  download=True, transform=transform)
test_data  = datasets.FashionMNIST("./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader  = DataLoader(test_data,  batch_size=BATCH_SIZE, shuffle=False)

# ── Model ────────────────────────────────────────────────────
class FashionClassifier(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),                          # (B,1,28,28) -> (B,784)
            nn.Linear(784, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_dim, 10),
        )

    def forward(self, x):
        return self.net(x)

model = FashionClassifier(HIDDEN_DIM).to(device)
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

# ── Loss and Optimizer ───────────────────────────────────────
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# ── Training ─────────────────────────────────────────────────
def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    total_loss, correct, total = 0.0, 0, 0

    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * labels.size(0)
        correct += (outputs.argmax(1) == labels).sum().item()
        total += labels.size(0)

    return total_loss / total, correct / total

# ── Evaluation ───────────────────────────────────────────────
def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss, correct, total = 0.0, 0, 0

    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)

            total_loss += loss.item() * labels.size(0)
            correct += (outputs.argmax(1) == labels).sum().item()
            total += labels.size(0)

    return total_loss / total, correct / total

# ── Run ──────────────────────────────────────────────────────
for epoch in range(NUM_EPOCHS):
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc   = evaluate(model, test_loader, criterion, device)

    print(f"Epoch {epoch+1:2d}/{NUM_EPOCHS}  "
          f"Train Loss: {train_loss:.4f}  Acc: {train_acc:.4f}  "
          f"Test Loss: {test_loss:.4f}  Acc: {test_acc:.4f}")

# ── Save ─────────────────────────────────────────────────────
torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "test_acc": test_acc,
}, "fashion_classifier_checkpoint.pth")
print(f"\nModel saved. Final test accuracy: {test_acc:.4f}")
Expected Output (approximate) Training on: cuda FashionClassifier( (net): Sequential( (0): Flatten(start_dim=1, end_dim=-1) (1): Linear(in_features=784, out_features=256, bias=True) (2): ReLU() (3): Dropout(p=0.2, inplace=False) (4): Linear(in_features=256, out_features=256, bias=True) (5): ReLU() (6): Dropout(p=0.2, inplace=False) (7): Linear(in_features=256, out_features=10, bias=True) ) ) Parameters: 267,530 Epoch 1/10 Train Loss: 0.5298 Acc: 0.8109 Test Loss: 0.4213 Acc: 0.8505 Epoch 2/10 Train Loss: 0.3876 Acc: 0.8590 Test Loss: 0.3887 Acc: 0.8586 Epoch 3/10 Train Loss: 0.3510 Acc: 0.8712 Test Loss: 0.3601 Acc: 0.8684 ... Epoch 10/10 Train Loss: 0.2623 Acc: 0.9019 Test Loss: 0.3294 Acc: 0.8832 Model saved. Final test accuracy: 0.8832

9.1 Lab Discussion

Let us dissect the key design decisions:

9.2 Exercises for Further Practice

  1. Overfit a single batch: Take one batch from the train loader and train on it for 100 steps. Can you drive the loss to zero? If yes, your model and training loop are correct. If no, you have a bug.
  2. Add a learning rate scheduler: Use torch.optim.lr_scheduler.StepLR to decay the learning rate by 0.1 every 5 epochs. Does test accuracy improve?
  3. Switch to a CNN: Replace the fully connected layers with convolutional layers (nn.Conv2d, nn.MaxPool2d). You should be able to reach over 90% test accuracy.
  4. Add gradient clipping: Insert torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) before optimizer.step(). Monitor the gradient norms before and after clipping.

Key Takeaways

  1. Tensors are the atomic data structure. Master creation, reshaping, indexing, and device management before anything else.
  2. Autograd builds a computational graph dynamically. Calling .backward() walks the graph in reverse to compute gradients. Always remember to zero gradients between iterations.
  3. nn.Module organizes your model. Define layers in __init__, wire them in forward, and call the model (not .forward() directly) to benefit from hooks and other machinery.
  4. DataLoader handles batching, shuffling, and parallel loading. Pair it with Dataset for standard or custom data.
  5. The training loop follows a fixed rhythm: zero gradients, forward, loss, backward, step. Every neural network training (from this classifier to GPT) follows this pattern.
  6. Checkpointing saves both model and optimizer state so you can resume training after interruptions. Use state_dict for portability.
  7. Debugging tools (hooks, gradient inspection, profiler) are not luxuries. Use them early and often. A few minutes of profiling can save hours of guessing.
  8. Start simple. Overfit a single batch. Then scale to the full dataset. Then tune. This progression catches bugs at the cheapest possible stage.