Section 0.3: PyTorch Tutorial | Building Conversational AI

PyTorch is a Python library for numerical computation on tensors with two superpowers: automatic differentiation and seamless GPU acceleration. If NumPy gives you a fast calculator, PyTorch gives you a fast calculator that can also compute its own derivatives and run on a graphics card. This section walks through every concept you need, starting from the lowest level (tensors) and building up to a complete training pipeline.

1. Tensors: The Fundamental Data Structure

A tensor is a multi-dimensional array. Scalars, vectors, matrices, and higher-dimensional arrays are all tensors. PyTorch tensors behave like NumPy arrays but carry extra metadata: a dtype, a device (CPU or GPU), and an optional link to a computational graph for gradient computation.

1.1 Creating Tensors

◆ Key Insight

PyTorch defaults to float32 for floating-point tensors. This matters because GPUs are optimized for 32-bit arithmetic, and most deep learning happens at this precision. When you need to save memory (as we will with large language models), you can use float16 or bfloat16.

1.2 Indexing, Slicing, and Reshaping

1.3 Broadcasting

Broadcasting lets PyTorch perform element-wise operations on tensors of different shapes by automatically expanding dimensions. The rules mirror NumPy: dimensions are compared from right to left, and a dimension of size 1 is stretched to match the other tensor.

⚠ Warning: Silent Shape Bugs

Broadcasting can mask bugs. If you add tensors of shapes (3, 1) and (1, 4), PyTorch happily produces a (3, 4) result with no error. Always verify shapes with print(tensor.shape) when debugging unexpected results.

1.4 Device Management (CPU/GPU)

✗ Common Mistake: Device Mismatch

Trying cpu_tensor + gpu_tensor raises RuntimeError: Expected all tensors to be on the same device. The fix: move everything to the same device before operating. A good pattern is to define device once at the top of your script and use .to(device) everywhere.

2. Autograd: Automatic Differentiation

Autograd is PyTorch's engine for computing gradients. When you set requires_grad=True on a tensor, PyTorch records every operation performed on it in a directed acyclic graph (DAG). Calling .backward() on the final scalar output traverses that graph in reverse to compute the gradient of the output with respect to every leaf tensor.

2.1 A Minimal Example

2.2 The Computational Graph

Every operation creates a node in the graph. Intermediate tensors store a .grad_fn that records how they were created. The graph below shows what happens for a simple loss computation.

2.3 Gradient Accumulation

Gradients in PyTorch accumulate by default. If you call .backward() twice without zeroing gradients, the second set of gradients is added to the first. This is intentional (it enables gradient accumulation across mini-batches), but forgetting to zero gradients is the most common autograd bug.

ⓘ Note: torch.no_grad()

During inference (or any time you do not need gradients), wrap your code in with torch.no_grad():. This disables graph construction, reduces memory usage, and speeds up computation. You will see this in every evaluation loop.

3. Building Models with nn.Module

Raw tensors and autograd are powerful, but PyTorch provides torch.nn to organize parameters, layers, and forward computations into reusable modules. Every model you build in this course, from simple classifiers to full transformer architectures, will subclass nn.Module.

3.1 Your First nn.Module

◆ Key Insight

The __init__ method declares layers; the forward method defines the computation. Never call model.forward(x) directly. Instead, call model(x), which runs forward along with any registered hooks.

4. Data Loading: Dataset and DataLoader

PyTorch decouples data storage from data loading through two abstractions. Dataset defines how to access individual samples. DataLoader wraps a dataset to provide batching, shuffling, and parallel loading.

4.1 Custom Datasets

When your data is not a standard benchmark, subclass Dataset and implement __len__ and __getitem__:

5. The Training Loop

Training a neural network follows a rhythmic four-step pattern: forward pass, compute loss, backward pass, optimizer step. Every training loop you write, from a simple classifier to a billion-parameter LLM, follows this same skeleton.

5.1 Complete Training Loop

Understanding Optimizers: SGD, Adam, and AdamW

Before we write our first training loop, let us understand the optimizer that drives learning. Momentum smooths out noisy gradients by maintaining an exponential moving average of past gradients, preventing the optimizer from oscillating on noisy surfaces. Adaptive learning rates give each parameter its own learning rate, scaled by the history of its gradients; parameters with consistently large gradients get smaller steps, and vice versa. Adam combines both ideas. AdamW improves on Adam by decoupling weight decay from the gradient update, which produces better generalization and is now the preferred optimizer for training large language models.

Optimizer	Learning Rate	Momentum	Weight Decay	Best For
SGD	Single global rate	Optional (off by default)	Coupled with gradient	Convex problems, fine control
Adam	Per-parameter adaptive	Built in (first moment)	Coupled with gradient	Fast prototyping, general use
AdamW	Per-parameter adaptive	Built in (first moment)	Decoupled (proper regularization)	LLM pretraining, best generalization

⚠ Warning: model.train() vs model.eval()

Always call model.train() before training and model.eval() before evaluation. These toggle behaviors of layers like Dropout and BatchNorm. Forgetting model.eval() during validation leads to noisy, unreliable metrics.

6. Saving and Loading Models

PyTorch stores learned parameters in a dictionary called the state_dict. Saving the state dict (rather than the full model object) is the recommended approach because it is architecture-independent and portable.

ⓘ Note

Always pass weights_only=True to torch.load() in modern PyTorch (1.13+). This prevents arbitrary code execution from untrusted checkpoint files. If you need to load optimizer state or other non-tensor data, use weights_only=False only with files you trust.

7. Debugging: Hooks, Gradient Inspection, and Profiling

When your model does not train, you need tools to look inside. PyTorch provides several mechanisms for introspection.

7.1 Inspecting Gradients

7.2 Forward and Backward Hooks

Hooks let you inspect (or modify) data flowing through a module without changing its code. This is invaluable for debugging and later for techniques like activation patching in interpretability research.

7.3 Profiling with torch.profiler

◆ Key Insight

Profiling reveals where time is actually spent. In small models, data loading often dominates. In larger models, matrix multiplications dominate. Knowing this guides your optimization effort: increase num_workers for data-bound training, or use mixed precision for compute-bound training.

8. Common Mistakes and How to Fix Them

9. Lab: Build and Train a FashionMNIST Classifier

Let us put everything together. In this lab you will build a fully connected neural network that classifies FashionMNIST images into 10 categories (T-shirt, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot). The complete script below is copy-pasteable and runnable.

Symptom	Cause	Fix
`RuntimeError: mat1 and mat2 shapes cannot be multiplied`	Input tensor shape does not match the layer's expected input dimension	Print shapes with `print(x.shape)` before each layer; ensure you flatten or reshape correctly
Loss is `nan` after a few steps	Learning rate is too high, or numerical overflow	Lower the learning rate; add gradient clipping with `torch.nn.utils.clip_grad_norm_`
Loss never decreases	Forgot `optimizer.zero_grad()` or wrong loss function	Verify the training loop skeleton; try overfitting on a single batch first
`Expected all tensors to be on the same device`	Model is on GPU but data is on CPU (or vice versa)	Call `.to(device)` on both model and data
Validation accuracy worse than training	Forgot `model.eval()` or `torch.no_grad()`	Always wrap evaluation in `model.eval()` and `with torch.no_grad():`

Expected Output (approximate) Training on: cuda FashionClassifier( (net): Sequential( (0): Flatten(start_dim=1, end_dim=-1) (1): Linear(in_features=784, out_features=256, bias=True) (2): ReLU() (3): Dropout(p=0.2, inplace=False) (4): Linear(in_features=256, out_features=256, bias=True) (5): ReLU() (6): Dropout(p=0.2, inplace=False) (7): Linear(in_features=256, out_features=10, bias=True) ) ) Parameters: 267,530 Epoch 1/10 Train Loss: 0.5298 Acc: 0.8109 Test Loss: 0.4213 Acc: 0.8505 Epoch 2/10 Train Loss: 0.3876 Acc: 0.8590 Test Loss: 0.3887 Acc: 0.8586 Epoch 3/10 Train Loss: 0.3510 Acc: 0.8712 Test Loss: 0.3601 Acc: 0.8684 ... Epoch 10/10 Train Loss: 0.2623 Acc: 0.9019 Test Loss: 0.3294 Acc: 0.8832 Model saved. Final test accuracy: 0.8832

9.1 Lab Discussion

9.2 Exercises for Further Practice

Key Takeaways

Tensors are the atomic data structure. Master creation, reshaping, indexing, and device management before anything else.
Autograd builds a computational graph dynamically. Calling .backward() walks the graph in reverse to compute gradients. Always remember to zero gradients between iterations.
nn.Module organizes your model. Define layers in __init__, wire them in forward, and call the model (not .forward() directly) to benefit from hooks and other machinery.
DataLoader handles batching, shuffling, and parallel loading. Pair it with Dataset for standard or custom data.
The training loop follows a fixed rhythm: zero gradients, forward, loss, backward, step. Every neural network training (from this classifier to GPT) follows this pattern.
Checkpointing saves both model and optimizer state so you can resume training after interruptions. Use state_dict for portability.
Debugging tools (hooks, gradient inspection, profiler) are not luxuries. Use them early and often. A few minutes of profiling can save hours of guessing.
Start simple. Overfit a single batch. Then scale to the full dataset. Then tune. This progression catches bugs at the cheapest possible stage.