Section 26.11: Machine Unlearning

★ Big Picture

Machine unlearning is the ability to remove specific knowledge from a trained model without retraining from scratch. This capability is driven by three needs: GDPR right-to-erasure compliance (removing personal data), copyright compliance (removing copyrighted content), and safety alignment (removing dangerous knowledge). While retraining from scratch on a filtered dataset is the gold standard, it is prohibitively expensive for large models. Approximate unlearning methods trade off forgetting guarantees for computational efficiency.

1. Motivations for Unlearning

Motivation	What to Remove	Verification Challenge
GDPR right to erasure	Individual's personal data	Prove the model cannot reproduce the specific data
Copyright compliance	Copyrighted text, code, images	Verify no verbatim or near-verbatim reproduction
Safety alignment	Dangerous knowledge (bioweapons, hacking)	Ensure knowledge is not recoverable via fine-tuning
Model updates	Outdated or incorrect information	Confirm old facts are replaced, not just suppressed

Figure 26.11.1: Unlearning methods trade off between forgetting guarantees and computational cost.

2. Gradient Ascent Unlearning

import torch
from torch.utils.data import DataLoader

def gradient_ascent_unlearn(model, forget_loader: DataLoader,
                            retain_loader: DataLoader,
                            epochs: int = 3, lr: float = 1e-5,
                            alpha: float = 0.5):
    """Unlearn via gradient ascent on forget set + descent on retain set."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    for epoch in range(epochs):
        total_loss = 0
        forget_iter = iter(forget_loader)
        retain_iter = iter(retain_loader)

        for step in range(min(len(forget_loader), len(retain_loader))):
            # Gradient ASCENT on forget data (maximize loss = forget)
            forget_batch = next(forget_iter)
            forget_out = model(**forget_batch, labels=forget_batch["input_ids"])
            forget_loss = -forget_out.loss  # negate for ascent

            # Gradient DESCENT on retain data (minimize loss = keep)
            retain_batch = next(retain_iter)
            retain_out = model(**retain_batch, labels=retain_batch["input_ids"])
            retain_loss = retain_out.loss

            # Combined loss: forget + retain balance
            loss = alpha * forget_loss + (1 - alpha) * retain_loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch + 1}: avg loss = {total_loss / (step + 1):.4f}")

    return model

3. Task Vector Unlearning

import torch
from collections import OrderedDict

def compute_task_vector(base_weights: dict, finetuned_weights: dict) -> dict:
    """Compute the task vector (difference between fine-tuned and base)."""
    task_vector = OrderedDict()
    for key in base_weights:
        task_vector[key] = finetuned_weights[key] - base_weights[key]
    return task_vector

def negate_task_vector(base_weights: dict, task_vector: dict,
                       scale: float = 1.0) -> dict:
    """Remove a capability by negating the task vector."""
    result = OrderedDict()
    for key in base_weights:
        result[key] = base_weights[key] - scale * task_vector[key]
    return result

# Conceptual example:
# 1. Fine-tune base model on "toxic content generation"
# 2. Compute task_vector = finetuned_weights - base_weights
# 3. Subtract task_vector from base: unlearned = base - scale * task_vector
# Result: model with reduced ability to generate toxic content
print("Task vector unlearning: subtract the 'skill vector' to remove capability")

4. Evaluating Unlearning Quality

from dataclasses import dataclass

@dataclass
class UnlearningEvaluation:
    """Evaluate the quality of machine unlearning."""
    forget_accuracy: float     # lower is better (model forgot)
    retain_accuracy: float     # higher is better (model remembers)
    membership_inference_auc: float  # closer to 0.5 is better

    @property
    def forget_quality(self) -> str:
        if self.forget_accuracy < 0.1 and self.retain_accuracy > 0.9:
            return "excellent"
        elif self.forget_accuracy < 0.3 and self.retain_accuracy > 0.8:
            return "good"
        return "insufficient"

    @property
    def privacy_leakage(self) -> str:
        deviation = abs(self.membership_inference_auc - 0.5)
        if deviation < 0.05:
            return "minimal"
        elif deviation < 0.15:
            return "moderate"
        return "significant"

eval_result = UnlearningEvaluation(
    forget_accuracy=0.08, retain_accuracy=0.92,
    membership_inference_auc=0.53
)
print(f"Forget quality: {eval_result.forget_quality}")
print(f"Privacy leakage: {eval_result.privacy_leakage}")

Forget quality: excellent Privacy leakage: minimal

Figure 26.11.2: Good unlearning must score well on all three axes: forgetting the target data, retaining general capability, and resisting membership inference attacks.

⚠ Warning

Approximate unlearning methods (gradient ascent, task vectors) do not provide the same guarantees as retraining from scratch. Recent research has shown that "unlearned" knowledge can sometimes be recovered through targeted fine-tuning or carefully crafted prompts. For high-stakes regulatory compliance, these methods should be combined with other controls (access restrictions, output filtering) rather than relied upon alone.

📝 Note

LOKA (Localized Knowledge Ablation) identifies the specific neurons or attention heads that encode the target knowledge and zeroes out or modifies only those parameters. This surgical approach minimizes collateral damage to other capabilities but requires interpretability tools to locate the relevant parameters.

★ Key Insight

The evaluation of unlearning is as important as the unlearning itself. A model that simply refuses to answer questions about the target topic (output suppression) has not truly unlearned; the knowledge is still encoded in the weights and may leak through indirect queries or after fine-tuning. True unlearning must pass membership inference attacks, not just behavioral tests.

Knowledge Check

1. What are the three main motivations for machine unlearning in LLMs?

Show Answer

GDPR right to erasure (removing individual personal data), copyright compliance (removing copyrighted content from model knowledge), and safety alignment (removing dangerous capabilities like bioweapon synthesis instructions). Each motivation has different verification requirements and acceptable tradeoffs.

2. How does gradient ascent achieve unlearning?

Show Answer

Gradient ascent maximizes the loss on the forget set (the data to be removed) while minimizing the loss on the retain set (data that should be preserved). This pushes the model away from being able to correctly predict or reproduce the forget data while maintaining performance on everything else. The balance between forget and retain is controlled by the alpha hyperparameter.

3. What is a task vector and how can it be used for unlearning?

Show Answer

A task vector is the weight difference between a fine-tuned model and the base model: task_vector = finetuned_weights - base_weights. It encodes the "skill" learned during fine-tuning. For unlearning, you first fine-tune the base model specifically on the knowledge to remove, compute the task vector, then subtract it from the original model weights. This removes the encoded capability.

4. Why is membership inference AUC an important metric for unlearning evaluation?

Show Answer

Membership inference attacks try to determine whether a specific example was in the training set. After successful unlearning, an attacker should not be able to distinguish forgotten examples from never-seen examples, yielding an AUC near 0.5 (random chance). An AUC significantly above 0.5 indicates that the model still retains detectable traces of the supposedly forgotten data, meaning the unlearning was incomplete.

5. Why is output suppression (refusing to answer) not the same as true unlearning?

Show Answer

Output suppression trains the model to refuse questions about the target topic, but the knowledge remains encoded in the model's weights. This surface-level behavior can be bypassed through jailbreaking, indirect questioning, or fine-tuning the refusal behavior away. True unlearning removes the knowledge from the weights themselves, so it cannot be recovered by any prompting strategy or subsequent training.

Key Takeaways

Machine unlearning removes specific knowledge from trained models, motivated by GDPR, copyright, and safety requirements.
Exact unlearning (retraining from scratch) provides complete guarantees but is prohibitively expensive for large LLMs.
Gradient ascent unlearning maximizes loss on the forget set while preserving performance on the retain set.
Task vector unlearning identifies and subtracts the weight direction encoding the target knowledge.
Evaluate unlearning on three axes: forget quality, retain quality, and resistance to membership inference attacks.
Output suppression (refusing to answer) is not true unlearning; the knowledge remains in the weights and can be recovered.