Module 26 · Section 26.11

Machine Unlearning

Unlearning methods, gradient ascent, LOKA, task vectors, evaluation of forgetting quality, and regulatory motivations
★ Big Picture

Machine unlearning is the ability to remove specific knowledge from a trained model without retraining from scratch. This capability is driven by three needs: GDPR right-to-erasure compliance (removing personal data), copyright compliance (removing copyrighted content), and safety alignment (removing dangerous knowledge). While retraining from scratch on a filtered dataset is the gold standard, it is prohibitively expensive for large models. Approximate unlearning methods trade off forgetting guarantees for computational efficiency.

1. Motivations for Unlearning

MotivationWhat to RemoveVerification Challenge
GDPR right to erasureIndividual's personal dataProve the model cannot reproduce the specific data
Copyright complianceCopyrighted text, code, imagesVerify no verbatim or near-verbatim reproduction
Safety alignmentDangerous knowledge (bioweapons, hacking)Ensure knowledge is not recoverable via fine-tuning
Model updatesOutdated or incorrect informationConfirm old facts are replaced, not just suppressed
Machine Unlearning Methods Exact Unlearning Retrain from scratch on filtered dataset Guarantee: complete Cost: prohibitive for large LLMs Gold standard Approximate Gradient ascent on forget set Guarantee: partial Cost: moderate (few epochs) Most practical Weight Editing Task vectors, LOKA, representation surgery Guarantee: targeted Cost: low (no training) Emerging research
Figure 26.11.1: Unlearning methods trade off between forgetting guarantees and computational cost.

2. Gradient Ascent Unlearning

import torch
from torch.utils.data import DataLoader

def gradient_ascent_unlearn(model, forget_loader: DataLoader,
                            retain_loader: DataLoader,
                            epochs: int = 3, lr: float = 1e-5,
                            alpha: float = 0.5):
    """Unlearn via gradient ascent on forget set + descent on retain set."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    for epoch in range(epochs):
        total_loss = 0
        forget_iter = iter(forget_loader)
        retain_iter = iter(retain_loader)

        for step in range(min(len(forget_loader), len(retain_loader))):
            # Gradient ASCENT on forget data (maximize loss = forget)
            forget_batch = next(forget_iter)
            forget_out = model(**forget_batch, labels=forget_batch["input_ids"])
            forget_loss = -forget_out.loss  # negate for ascent

            # Gradient DESCENT on retain data (minimize loss = keep)
            retain_batch = next(retain_iter)
            retain_out = model(**retain_batch, labels=retain_batch["input_ids"])
            retain_loss = retain_out.loss

            # Combined loss: forget + retain balance
            loss = alpha * forget_loss + (1 - alpha) * retain_loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch + 1}: avg loss = {total_loss / (step + 1):.4f}")

    return model

3. Task Vector Unlearning

import torch
from collections import OrderedDict

def compute_task_vector(base_weights: dict, finetuned_weights: dict) -> dict:
    """Compute the task vector (difference between fine-tuned and base)."""
    task_vector = OrderedDict()
    for key in base_weights:
        task_vector[key] = finetuned_weights[key] - base_weights[key]
    return task_vector

def negate_task_vector(base_weights: dict, task_vector: dict,
                       scale: float = 1.0) -> dict:
    """Remove a capability by negating the task vector."""
    result = OrderedDict()
    for key in base_weights:
        result[key] = base_weights[key] - scale * task_vector[key]
    return result

# Conceptual example:
# 1. Fine-tune base model on "toxic content generation"
# 2. Compute task_vector = finetuned_weights - base_weights
# 3. Subtract task_vector from base: unlearned = base - scale * task_vector
# Result: model with reduced ability to generate toxic content
print("Task vector unlearning: subtract the 'skill vector' to remove capability")

4. Evaluating Unlearning Quality

from dataclasses import dataclass

@dataclass
class UnlearningEvaluation:
    """Evaluate the quality of machine unlearning."""
    forget_accuracy: float     # lower is better (model forgot)
    retain_accuracy: float     # higher is better (model remembers)
    membership_inference_auc: float  # closer to 0.5 is better

    @property
    def forget_quality(self) -> str:
        if self.forget_accuracy < 0.1 and self.retain_accuracy > 0.9:
            return "excellent"
        elif self.forget_accuracy < 0.3 and self.retain_accuracy > 0.8:
            return "good"
        return "insufficient"

    @property
    def privacy_leakage(self) -> str:
        deviation = abs(self.membership_inference_auc - 0.5)
        if deviation < 0.05:
            return "minimal"
        elif deviation < 0.15:
            return "moderate"
        return "significant"

eval_result = UnlearningEvaluation(
    forget_accuracy=0.08, retain_accuracy=0.92,
    membership_inference_auc=0.53
)
print(f"Forget quality: {eval_result.forget_quality}")
print(f"Privacy leakage: {eval_result.privacy_leakage}")
Forget quality: excellent Privacy leakage: minimal
Three Axes of Unlearning Evaluation Forget Quality Can the model still reproduce the data? Metric: accuracy on forget set (lower = better) Target: near random Retain Quality Does the model still work on other tasks? Metric: accuracy on retain set (higher = better) Target: unchanged Privacy Can an attacker detect the removed data? Metric: MIA AUC (closer to 0.5 = better) Target: 0.5 (random)
Figure 26.11.2: Good unlearning must score well on all three axes: forgetting the target data, retaining general capability, and resisting membership inference attacks.
⚠ Warning

Approximate unlearning methods (gradient ascent, task vectors) do not provide the same guarantees as retraining from scratch. Recent research has shown that "unlearned" knowledge can sometimes be recovered through targeted fine-tuning or carefully crafted prompts. For high-stakes regulatory compliance, these methods should be combined with other controls (access restrictions, output filtering) rather than relied upon alone.

📝 Note

LOKA (Localized Knowledge Ablation) identifies the specific neurons or attention heads that encode the target knowledge and zeroes out or modifies only those parameters. This surgical approach minimizes collateral damage to other capabilities but requires interpretability tools to locate the relevant parameters.

★ Key Insight

The evaluation of unlearning is as important as the unlearning itself. A model that simply refuses to answer questions about the target topic (output suppression) has not truly unlearned; the knowledge is still encoded in the weights and may leak through indirect queries or after fine-tuning. True unlearning must pass membership inference attacks, not just behavioral tests.

Knowledge Check

1. What are the three main motivations for machine unlearning in LLMs?

Show Answer
GDPR right to erasure (removing individual personal data), copyright compliance (removing copyrighted content from model knowledge), and safety alignment (removing dangerous capabilities like bioweapon synthesis instructions). Each motivation has different verification requirements and acceptable tradeoffs.

2. How does gradient ascent achieve unlearning?

Show Answer
Gradient ascent maximizes the loss on the forget set (the data to be removed) while minimizing the loss on the retain set (data that should be preserved). This pushes the model away from being able to correctly predict or reproduce the forget data while maintaining performance on everything else. The balance between forget and retain is controlled by the alpha hyperparameter.

3. What is a task vector and how can it be used for unlearning?

Show Answer
A task vector is the weight difference between a fine-tuned model and the base model: task_vector = finetuned_weights - base_weights. It encodes the "skill" learned during fine-tuning. For unlearning, you first fine-tune the base model specifically on the knowledge to remove, compute the task vector, then subtract it from the original model weights. This removes the encoded capability.

4. Why is membership inference AUC an important metric for unlearning evaluation?

Show Answer
Membership inference attacks try to determine whether a specific example was in the training set. After successful unlearning, an attacker should not be able to distinguish forgotten examples from never-seen examples, yielding an AUC near 0.5 (random chance). An AUC significantly above 0.5 indicates that the model still retains detectable traces of the supposedly forgotten data, meaning the unlearning was incomplete.

5. Why is output suppression (refusing to answer) not the same as true unlearning?

Show Answer
Output suppression trains the model to refuse questions about the target topic, but the knowledge remains encoded in the model's weights. This surface-level behavior can be bypassed through jailbreaking, indirect questioning, or fine-tuning the refusal behavior away. True unlearning removes the knowledge from the weights themselves, so it cannot be recovered by any prompting strategy or subsequent training.

Key Takeaways