Module 13 · Section 13.6

Fine-Tuning for Classification & Sequence Tasks

Adding classification heads, handling single-label, multi-label, and token-level tasks with Hugging Face AutoModels and strategies for class imbalance
★ Big Picture

Classification is the most common fine-tuning task in production NLP. Sentiment analysis, spam detection, intent classification, named entity recognition, and content moderation all reduce to some form of classification. The approach is different from generative fine-tuning (SFT): instead of training the model to generate text, you add a classification head on top of the pre-trained model and train it to predict discrete labels. Hugging Face's AutoModel classes make this straightforward, but there are important decisions around architecture, loss functions, and class imbalance that determine whether your classifier works well in practice.

1. Classification Head Architecture

When fine-tuning a transformer for classification, you keep the pre-trained encoder body and add a small classification head on top. The head is typically a linear layer (or a small MLP) that maps the model's hidden representation to class logits. The entire model (encoder plus head) is trained end-to-end, but the head learns from scratch while the encoder benefits from pre-trained knowledge.

Classification Head on a Pre-trained Transformer [CLS] This movie was absolutely fantastic [SEP] Input tokens Pre-trained Transformer Encoder 12 layers, 768 hidden dim (BERT-base) [CLS] hidden state (768d) Dropout (0.1) Linear (768 -> num_labels) Logits: [2.1, -0.8, 0.3] -> "Positive"
Figure 13.13: The [CLS] token representation is passed through a dropout layer and linear classification head

2. Single-Label Classification

Single-label classification is the simplest case: each input belongs to exactly one class. Examples include sentiment analysis (positive/negative/neutral), intent classification (booking/cancellation/inquiry), and content moderation (safe/unsafe). Hugging Face provides AutoModelForSequenceClassification that handles the architecture automatically.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Load model with classification head
model_name = "bert-base-uncased"
num_labels = 3  # positive, negative, neutral

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    problem_type="single_label_classification",
    # Map label indices to human-readable names
    id2label={0: "negative", 1: "neutral", 2: "positive"},
    label2id={"negative": 0, "neutral": 1, "positive": 2},
)

# Load and tokenize dataset
dataset = load_dataset("sst2")  # Stanford Sentiment Treebank

def tokenize_function(examples):
    return tokenizer(
        examples["sentence"],
        padding="max_length",
        truncation=True,
        max_length=128,
    )

tokenized = dataset.map(tokenize_function, batched=True)

# Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, predictions),
        "f1_macro": f1_score(labels, predictions, average="macro"),
        "f1_weighted": f1_score(labels, predictions, average="weighted"),
    }

# Training configuration
training_args = TrainingArguments(
    output_dir="./checkpoints/sentiment-bert",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    compute_metrics=compute_metrics,
)

trainer.train()
📝 Note

AutoModelForSequenceClassification handles everything. When you specify num_labels, the model automatically adds the correct classification head. For 2 labels, it uses binary cross-entropy; for 3+ labels, it uses cross-entropy. The id2label and label2id mappings are saved with the model, so you do not need to track them separately at inference time.

3. Multi-Label Classification

In multi-label classification, each input can belong to zero, one, or multiple classes simultaneously. Examples include topic tagging (an article can be about both "politics" and "economics"), content warning systems (a post can be both "violent" and "sexually explicit"), and skill detection (a resume can mention both "Python" and "machine learning"). The key difference is using sigmoid activation instead of softmax and binary cross-entropy loss instead of cross-entropy.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Multi-label setup
labels = ["politics", "economics", "technology", "sports", "entertainment"]
num_labels = len(labels)

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=num_labels,
    problem_type="multi_label_classification",  # Key difference
    id2label={i: l for i, l in enumerate(labels)},
    label2id={l: i for i, l in enumerate(labels)},
)

# Multi-label data preparation
def prepare_multi_label_example(text: str, active_labels: list) -> dict:
    """Convert a multi-label example to training format."""
    # Create binary label vector
    label_vector = [1.0 if l in active_labels else 0.0 for l in labels]
    return {
        "text": text,
        "labels": label_vector,  # Float tensor for BCE loss
    }

# Example data
examples = [
    prepare_multi_label_example(
        "The new AI regulation bill passed the Senate today",
        ["politics", "technology"]
    ),
    prepare_multi_label_example(
        "Stock markets rallied after the Fed announced rate cuts",
        ["economics"]
    ),
    prepare_multi_label_example(
        "The tech startup raised $50M for its AI-powered sports analytics platform",
        ["technology", "sports", "economics"]
    ),
]

# Multi-label metrics
from sklearn.metrics import (
    f1_score, precision_score, recall_score,
    hamming_loss, classification_report,
)

def compute_multi_label_metrics(eval_pred):
    logits, labels = eval_pred
    # Apply sigmoid (not softmax) for multi-label
    predictions = (torch.sigmoid(torch.tensor(logits)) > 0.5).int().numpy()

    return {
        "f1_micro": f1_score(labels, predictions, average="micro"),
        "f1_macro": f1_score(labels, predictions, average="macro"),
        "f1_samples": f1_score(labels, predictions, average="samples"),
        "hamming_loss": hamming_loss(labels, predictions),
        "precision_macro": precision_score(labels, predictions, average="macro"),
        "recall_macro": recall_score(labels, predictions, average="macro"),
    }
🔑 Key Insight

Sigmoid vs. softmax: the critical distinction. Single-label uses softmax, which normalizes logits into a probability distribution that sums to 1 (the model must pick one class). Multi-label uses sigmoid, which independently maps each logit to a probability between 0 and 1 (each class is an independent binary decision). Using the wrong activation function is a common bug that leads to poor multi-label performance.

4. Token Classification (NER, POS)

Token classification assigns a label to each individual token in the input sequence. Named Entity Recognition (NER) is the most common example: labeling each token as a person, organization, location, or none. Part-of-speech (POS) tagging is another classic token classification task. The model architecture uses every token's representation (not just [CLS]) and applies a classification head to each one.

Token Classification: Per-Token Labels John works at Google in Mountain View Transformer Encoder (per-token representations) Linear classification head applied to each token B-PER O O B-ORG O B-LOC I-LOC B = Begin entity, I = Inside entity, O = Outside (no entity)
Figure 13.14: Token classification applies a classification head to each token, using BIO tagging for entity boundaries
from transformers import AutoModelForTokenClassification, AutoTokenizer
from datasets import load_dataset

# NER label scheme (BIO format)
label_list = [
    "O", "B-PER", "I-PER", "B-ORG", "I-ORG",
    "B-LOC", "I-LOC", "B-MISC", "I-MISC"
]
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for i, l in enumerate(label_list)}

# Load model for token classification
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Load CoNLL-2003 NER dataset
dataset = load_dataset("conll2003")

def tokenize_and_align_labels(examples):
    """Tokenize and align NER labels with subword tokens."""
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,  # Input is already tokenized
        padding="max_length",
        max_length=128,
    )

    labels = []
    for i, label_ids in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_row = []
        previous_word_id = None

        for word_id in word_ids:
            if word_id is None:
                # Special tokens ([CLS], [SEP], [PAD])
                label_row.append(-100)
            elif word_id != previous_word_id:
                # First subword token of a word: use the word's label
                label_row.append(label_ids[word_id])
            else:
                # Subsequent subword tokens: use I- tag or -100
                original_label = label_ids[word_id]
                # Convert B- to I- for continuation tokens
                label_name = label_list[original_label]
                if label_name.startswith("B-"):
                    i_label = label_name.replace("B-", "I-")
                    label_row.append(label2id.get(i_label, original_label))
                else:
                    label_row.append(original_label)
            previous_word_id = word_id

        labels.append(label_row)

    tokenized["labels"] = labels
    return tokenized

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
⚠ Warning

Subword tokenization breaks word boundaries. A critical challenge in token classification is that the tokenizer may split a single word into multiple subword tokens. The word "Mountain" might become ["Mount", "##ain"]. You must carefully align the original word-level labels with the subword tokens. The standard approach is to assign the label to the first subword and use -100 (ignore) or the corresponding I- tag for continuation subwords.

5. Sequence-Pair Tasks

Some classification tasks require comparing two input texts. Natural language inference (NLI) classifies the relationship between a premise and hypothesis as entailment, contradiction, or neutral. Semantic textual similarity (STS) scores the similarity between two sentences. Question answering classification determines whether a passage contains the answer to a question. All of these are handled with the same AutoModelForSequenceClassification by providing both texts to the tokenizer.

# Sequence pair classification (NLI example)
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3,
    id2label={0: "entailment", 1: "neutral", 2: "contradiction"},
    label2id={"entailment": 0, "neutral": 1, "contradiction": 2},
)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize a sentence pair
def tokenize_nli(examples):
    """Tokenize premise-hypothesis pairs."""
    return tokenizer(
        examples["premise"],
        examples["hypothesis"],
        padding="max_length",
        truncation=True,
        max_length=256,
    )

# The tokenizer automatically adds [SEP] between the two inputs:
# [CLS] premise tokens [SEP] hypothesis tokens [SEP]
sample = tokenizer(
    "A man is playing guitar on stage.",
    "A musician is performing live.",
    return_tensors="pt",
)
print(f"Input: {tokenizer.decode(sample['input_ids'][0])}")
print(f"Token type IDs: {sample['token_type_ids'][0][:20].tolist()}")
# token_type_ids: 0 for premise tokens, 1 for hypothesis tokens

6. Handling Class Imbalance

Real-world classification datasets are almost always imbalanced. Fraud detection might have 0.1% positive examples; medical diagnosis datasets often have rare conditions representing less than 1% of cases. Without mitigation, the model will learn to predict the majority class and ignore rare but important classes.

StrategyHow It WorksWhen to Use
Weighted lossAssign higher loss weight to minority classesModerate imbalance (5:1 to 20:1)
OversamplingDuplicate minority class examplesSmall datasets where more data helps
UndersamplingRemove majority class examplesVery large datasets with extreme imbalance
Focal lossDown-weight easy examples, focus on hard onesExtreme imbalance (100:1+)
Synthetic dataGenerate additional minority examples with LLMsWhen real minority data is scarce
import torch
import torch.nn as nn
from torch.utils.data import WeightedRandomSampler
import numpy as np

class WeightedTrainer(Trainer):
    """Custom Trainer with class-weighted loss for imbalanced data."""

    def __init__(self, class_weights=None, **kwargs):
        super().__init__(**kwargs)
        if class_weights is not None:
            self.class_weights = torch.tensor(
                class_weights, dtype=torch.float32
            )
        else:
            self.class_weights = None

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits

        if self.class_weights is not None:
            weight = self.class_weights.to(logits.device)
            loss_fn = nn.CrossEntropyLoss(weight=weight)
        else:
            loss_fn = nn.CrossEntropyLoss()

        loss = loss_fn(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

# Calculate class weights from data distribution
def compute_class_weights(labels: list, strategy: str = "inverse") -> list:
    """Compute class weights for imbalanced datasets."""
    unique, counts = np.unique(labels, return_counts=True)
    total = len(labels)

    if strategy == "inverse":
        # Weight inversely proportional to frequency
        weights = total / (len(unique) * counts)
    elif strategy == "sqrt_inverse":
        # Softer version: square root of inverse frequency
        weights = np.sqrt(total / (len(unique) * counts))
    elif strategy == "effective":
        # Effective number of samples (Class-Balanced Loss)
        beta = 0.9999
        effective_num = 1.0 - np.power(beta, counts)
        weights = (1.0 - beta) / effective_num

    # Normalize so mean weight = 1
    weights = weights / weights.mean()
    return weights.tolist()

# Example: imbalanced dataset
labels = [0]*9000 + [1]*800 + [2]*200  # 90% / 8% / 2% distribution
weights = compute_class_weights(labels, strategy="sqrt_inverse")
print(f"Class weights: {[f'{w:.2f}' for w in weights]}")
print(f"Class 0 (90%): {weights[0]:.2f}x")
print(f"Class 2 (2%):  {weights[2]:.2f}x")
Class weights: ['0.58', '1.23', '2.46'] Class 0 (90%): 0.58x Class 2 (2%): 2.46x
🔑 Key Insight

Use F1 macro (not accuracy) for imbalanced datasets. Accuracy is misleading when classes are imbalanced: a model that always predicts the majority class achieves 90% accuracy on a 90/10 split. F1 macro averages the F1 score across all classes equally, giving equal weight to minority classes. Always track per-class precision and recall to understand where the model is failing.

Section 13.6 Quiz

1. What is the key difference between the loss function used for single-label and multi-label classification?
Show Answer
Single-label classification uses cross-entropy loss with softmax, which normalizes logits into a probability distribution that sums to 1 (exactly one class must be selected). Multi-label classification uses binary cross-entropy loss with sigmoid, which independently maps each class logit to a probability between 0 and 1 (any combination of classes can be active). Using softmax for multi-label tasks forces the model to trade off between classes and prevents it from assigning high probability to multiple classes simultaneously.
2. In NER with BIO tagging, why do we need to align labels with subword tokens?
Show Answer
Transformers use subword tokenization, which can split a single word into multiple tokens. For example, "Washington" might become ["Wash", "##ington"]. NER labels are defined at the word level, so we need a strategy to assign labels to each subword. The standard approach assigns the original B- label to the first subword and either the corresponding I- label or -100 (ignore in loss) to continuation subwords. Without this alignment, the model would receive incorrect training signals.
3. A fraud detection dataset has 99.9% legitimate transactions and 0.1% fraudulent ones. Which class imbalance strategy would you recommend?
Show Answer
For such extreme imbalance (1000:1), use a combination of strategies: (1) Focal loss to down-weight the overwhelming easy majority examples and focus learning on the hard minority examples. (2) Oversampling of the fraud class, potentially combined with synthetic data generation using an LLM to create additional realistic fraud examples. (3) Always evaluate using precision, recall, and F1 for the fraud class specifically, not overall accuracy. Additionally, consider whether the task can be reformulated as anomaly detection rather than classification.
4. How does AutoModelForSequenceClassification handle sentence-pair tasks like NLI?
Show Answer
The tokenizer handles sentence pairs by concatenating them with a [SEP] token: [CLS] premise [SEP] hypothesis [SEP]. It also generates token_type_ids that mark which tokens belong to the first sentence (0) and which belong to the second (1). The model's attention mechanism can then attend across both sentences, and the [CLS] token representation captures the relationship between them. The same AutoModelForSequenceClassification class works for both single-text and pair tasks.
5. Why should you use F1 macro instead of accuracy when evaluating on imbalanced datasets?
Show Answer
Accuracy is misleading on imbalanced datasets because a trivial model that always predicts the majority class achieves high accuracy. For example, on a 95%/5% split, always predicting the majority class yields 95% accuracy while completely failing on the minority class. F1 macro computes the F1 score for each class independently and then averages them, giving equal weight to all classes regardless of their frequency. This ensures that poor performance on minority classes is reflected in the overall metric.

Key Takeaways