Section 12.5: LLM-Assisted Labeling & Active Learning

★ Big Picture

The best labeling workflows combine LLM speed with human judgment. Pure human annotation is too slow and expensive. Pure LLM labeling introduces systematic biases. The optimal approach uses LLMs to pre-label data at scale, then routes uncertain or high-stakes examples to human reviewers. Active learning further optimizes this loop by selecting the most informative examples for human annotation, maximizing the value of every human label. This section teaches you to build these hybrid labeling workflows from scratch.

1. LLM Pre-Labeling for Annotation Speedup

LLM pre-labeling uses a large language model to generate initial labels for your unlabeled dataset. Human annotators then review and correct these labels rather than creating them from scratch. Studies consistently show that reviewing a pre-existing label is 2x to 5x faster than labeling from scratch, even when the pre-label has errors. The key is that the LLM gets most labels approximately right, and humans only need to identify and fix the mistakes.

1.1 The Pre-Labeling Workflow

Figure 12.5.1: LLM pre-labeling with confidence-based routing to human review.

import json
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class PreLabel:
    text: str
    label: str
    confidence: float
    reasoning: str

def llm_prelabel(
    texts: list[str],
    label_options: list[str],
    task_description: str,
    model: str = "gpt-4o-mini"
) -> list[PreLabel]:
    """Pre-label a batch of texts using an LLM with confidence scores."""
    labels_str = ", ".join(f'"{l}"' for l in label_options)
    results = []

    for text in texts:
        prompt = f"""Task: {task_description}

Text: "{text}"

Available labels: [{labels_str}]

Classify this text. Provide:
1. The label (must be one of the available options)
2. Your confidence (0.0 to 1.0)
3. Brief reasoning

Respond as JSON:
{{
  "label": "chosen_label",
  "confidence": 0.95,
  "reasoning": "why this label"
}}"""

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            response_format={"type": "json_object"}
        )

        data = json.loads(response.choices[0].message.content)
        results.append(PreLabel(
            text=text,
            label=data["label"],
            confidence=data["confidence"],
            reasoning=data.get("reasoning", "")
        ))

    return results

# Example: Sentiment classification pre-labeling
texts = [
    "This product exceeded all my expectations. Highly recommend!",
    "The delivery was okay but the packaging was damaged.",
    "Worst purchase I have ever made. Complete waste of money.",
    "It does what it says. Nothing special, nothing terrible.",
    "I cannot figure out how to set this up. Instructions are unclear.",
]

prelabels = llm_prelabel(
    texts=texts,
    label_options=["positive", "negative", "neutral", "mixed"],
    task_description="Classify the sentiment of this product review."
)

for pl in prelabels:
    route = "AUTO" if pl.confidence >= 0.85 else "HUMAN"
    print(f"[{route}] {pl.label} ({pl.confidence:.2f}): "
          f"{pl.text[:50]}...")

[AUTO] positive (0.95): This product exceeded all my expectations. Highly... [HUMAN] mixed (0.72): The delivery was okay but the packaging was dama... [AUTO] negative (0.97): Worst purchase I have ever made. Complete waste ... [AUTO] neutral (0.88): It does what it says. Nothing special, nothing te... [HUMAN] negative (0.78): I cannot figure out how to set this up. Instructi...

2. Confidence-Based Routing

The routing decision determines which examples go to human reviewers and which are auto-accepted. The confidence threshold is the most critical hyperparameter in this system. Set it too high and you overwhelm human reviewers with easy cases. Set it too low and you accept noisy labels that degrade model training.

2.1 Finding the Optimal Threshold

import numpy as np

def find_optimal_threshold(
    confidences: list[float],
    llm_labels: list[str],
    gold_labels: list[str],
    budget_fraction: float = 0.3
) -> dict:
    """Find the confidence threshold that maximizes accuracy within budget.

    Args:
        confidences: LLM confidence scores
        llm_labels: LLM-assigned labels
        gold_labels: Ground truth labels (from a calibration set)
        budget_fraction: Max fraction of data to send to humans
    """
    thresholds = np.arange(0.5, 1.0, 0.05)
    results = []

    for thresh in thresholds:
        auto_mask = [c >= thresh for c in confidences]
        human_mask = [c < thresh for c in confidences]

        # Auto-accepted: use LLM labels
        auto_correct = sum(
            1 for i, m in enumerate(auto_mask)
            if m and llm_labels[i] == gold_labels[i]
        )
        auto_total = sum(auto_mask)

        # Human-reviewed: assume humans are correct
        human_total = sum(human_mask)
        human_fraction = human_total / len(confidences)

        # Overall accuracy: auto correct + human correct (100%)
        total_correct = auto_correct + human_total
        overall_accuracy = total_correct / len(confidences)

        # Auto accuracy (LLM alone at this threshold)
        auto_accuracy = auto_correct / max(auto_total, 1)

        results.append({
            "threshold": round(thresh, 2),
            "auto_accuracy": round(auto_accuracy, 4),
            "human_fraction": round(human_fraction, 4),
            "overall_accuracy": round(overall_accuracy, 4),
            "within_budget": human_fraction <= budget_fraction
        })

    # Find best threshold within budget
    valid = [r for r in results if r["within_budget"]]
    best = max(valid, key=lambda r: r["overall_accuracy"])

    return {
        "optimal_threshold": best["threshold"],
        "expected_accuracy": best["overall_accuracy"],
        "human_review_rate": best["human_fraction"],
        "all_thresholds": results
    }

# Simulated calibration data
np.random.seed(42)
n = 200
confidences = np.random.beta(5, 2, n).tolist()  # Skewed toward high conf
labels = ["positive", "negative", "neutral"]
llm_labels = [np.random.choice(labels) for _ in range(n)]
gold_labels = [
    l if np.random.random() < c else np.random.choice(labels)
    for l, c in zip(llm_labels, confidences)
]

result = find_optimal_threshold(confidences, llm_labels, gold_labels, 0.3)
print(f"Optimal threshold: {result['optimal_threshold']}")
print(f"Expected accuracy: {result['expected_accuracy']:.1%}")
print(f"Human review rate: {result['human_review_rate']:.1%}")

Optimal threshold: 0.70 Expected accuracy: 95.5% Human review rate: 27.0%

★ Key Insight

Always calibrate your confidence threshold on a held-out set with gold labels before deploying. LLM confidence scores are notoriously poorly calibrated: a model that says "0.90 confidence" may only be correct 75% of the time. The calibration step maps reported confidence to actual accuracy, allowing you to set thresholds based on real error rates rather than the model's self-reported certainty.

3. Active Learning with LLM Integration

Active learning selects the most informative examples for human annotation, maximizing the value of every labeled example. Instead of randomly sampling from the unlabeled pool, active learning strategies identify examples where a label would most improve the model. When combined with LLM pre-labeling, active learning can reduce annotation costs by 40% to 70% while achieving the same model performance.

3.1 Active Learning Strategies

Strategy	Selection Criterion	Best When	Weakness
Uncertainty Sampling	Most uncertain predictions	Model needs decision boundary refinement	Can over-sample outliers
Diversity Sampling	Most different from labeled set	Need broad coverage of input space	May miss decision boundary cases
Committee Disagreement	Multiple models disagree	Multiple models available	Expensive (multiple inferences)
Expected Model Change	Labels that would change model most	Expensive labels, small budgets	Computationally expensive
Hybrid (Uncertainty + Diversity)	Weighted combination	General purpose, most practical	Requires tuning the weight

Figure 12.5.2: Active learning loop: score unlabeled examples, select the most informative batch, annotate, retrain, and repeat.

import numpy as np
from sklearn.metrics.pairwise import cosine_distances

def uncertainty_sampling(
    predictions: np.ndarray,
    n_select: int = 50
) -> np.ndarray:
    """Select examples where the model is most uncertain.

    Args:
        predictions: Array of shape (n_samples, n_classes) with
                     predicted probabilities
        n_select: Number of examples to select

    Returns:
        Indices of selected examples
    """
    # Entropy-based uncertainty
    entropy = -np.sum(
        predictions * np.log(predictions + 1e-10), axis=1
    )
    # Select top-k most uncertain
    return np.argsort(entropy)[-n_select:]

def diversity_sampling(
    embeddings: np.ndarray,
    labeled_embeddings: np.ndarray,
    n_select: int = 50
) -> np.ndarray:
    """Select examples most different from the already-labeled set.

    Uses maximum distance to nearest labeled example (core-set approach).
    """
    # Distance from each unlabeled example to nearest labeled example
    distances = cosine_distances(embeddings, labeled_embeddings)
    min_distances = distances.min(axis=1)
    # Select the most distant (most different from labeled set)
    return np.argsort(min_distances)[-n_select:]

def hybrid_acquisition(
    predictions: np.ndarray,
    embeddings: np.ndarray,
    labeled_embeddings: np.ndarray,
    n_select: int = 50,
    uncertainty_weight: float = 0.6
) -> np.ndarray:
    """Hybrid strategy: weighted combination of uncertainty and diversity."""
    # Normalize uncertainty scores to [0, 1]
    entropy = -np.sum(predictions * np.log(predictions + 1e-10), axis=1)
    max_entropy = np.log(predictions.shape[1])
    uncertainty_scores = entropy / max_entropy

    # Normalize diversity scores to [0, 1]
    distances = cosine_distances(embeddings, labeled_embeddings)
    min_distances = distances.min(axis=1)
    diversity_scores = min_distances / max(min_distances.max(), 1e-10)

    # Weighted combination
    combined = (
        uncertainty_weight * uncertainty_scores +
        (1 - uncertainty_weight) * diversity_scores
    )
    return np.argsort(combined)[-n_select:]

# Simulate an active learning scenario
np.random.seed(42)
n_unlabeled = 1000
n_classes = 4

# Simulated model predictions (some confident, some uncertain)
predictions = np.random.dirichlet(np.ones(n_classes) * 2, n_unlabeled)
embeddings = np.random.randn(n_unlabeled, 128)
labeled_embeddings = np.random.randn(100, 128)

# Select using each strategy
uncertain_idx = uncertainty_sampling(predictions, n_select=50)
diverse_idx = diversity_sampling(embeddings, labeled_embeddings, n_select=50)
hybrid_idx = hybrid_acquisition(
    predictions, embeddings, labeled_embeddings, n_select=50
)

# Check overlap between strategies
overlap_u_d = len(set(uncertain_idx) & set(diverse_idx))
overlap_u_h = len(set(uncertain_idx) & set(hybrid_idx))
print(f"Uncertainty vs Diversity overlap: {overlap_u_d}/50 examples")
print(f"Uncertainty vs Hybrid overlap: {overlap_u_h}/50 examples")
print(f"Hybrid captures both uncertain AND diverse examples")

Uncertainty vs Diversity overlap: 3/50 examples Uncertainty vs Hybrid overlap: 28/50 examples Hybrid captures both uncertain AND diverse examples

ⓘ Note

The low overlap between uncertainty and diversity sampling (3 out of 50) demonstrates that these strategies target fundamentally different types of informative examples. Uncertainty sampling finds examples near decision boundaries, while diversity sampling finds examples in unexplored regions of the input space. The hybrid approach captures value from both, making it the recommended default for most practical applications.

4. Annotation Tools

Production annotation workflows require purpose-built tools that support team management, quality control, pre-labeling integration, and export in standard formats. The three leading tools for NLP annotation each serve different needs.

Tool	License	Strengths	Best For	LLM Integration
Label Studio	Apache 2.0	Highly customizable, multi-modal, large community	General purpose annotation across text, image, audio	ML backend API for pre-labeling
Prodigy	Commercial	Fast binary annotation, active learning built-in	Rapid iterative labeling with model-in-the-loop	Custom recipe system for LLM integration
Argilla	Apache 2.0	Native LLM/NLP focus, HF Hub integration, Distilabel pairing	LLM output curation, preference labeling, RLHF data	First-class LLM pre-labeling support

# Label Studio: Setting up a pre-labeling backend with LLM
# This creates a backend service that Label Studio calls for predictions

from label_studio_ml.model import LabelStudioMLBase
from openai import OpenAI

class LLMPreLabeler(LabelStudioMLBase):
    """Label Studio ML backend that uses an LLM for pre-labeling."""

    def setup(self):
        self.client = OpenAI()
        self.model = "gpt-4o-mini"

    def predict(self, tasks, **kwargs):
        """Generate pre-labels for a batch of tasks."""
        predictions = []

        for task in tasks:
            text = task["data"].get("text", "")

            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{
                    "role": "user",
                    "content": f"Classify the sentiment of this text as "
                    f"'positive', 'negative', or 'neutral'.\n\n"
                    f"Text: {text}\n\nLabel:"
                }],
                temperature=0.1,
                max_tokens=10
            )

            label = response.choices[0].message.content.strip().lower()

            predictions.append({
                "result": [{
                    "from_name": "sentiment",
                    "to_name": "text",
                    "type": "choices",
                    "value": {"choices": [label]}
                }],
                "score": 0.85  # Confidence placeholder
            })

        return predictions

# To run: label-studio-ml start ./llm_backend
# Then connect in Label Studio: Settings > Machine Learning > Add Model
print("LLM pre-labeling backend configured for Label Studio")

5. Inter-Annotator Agreement

When multiple annotators (human or LLM) label the same examples, measuring their agreement is essential for understanding label quality. Low agreement indicates ambiguous guidelines, difficult examples, or inconsistent annotators. High agreement (but not perfect) suggests well-calibrated labeling. Agreement metrics also help identify when LLM labels are reliable enough to substitute for human labels.

import numpy as np
from itertools import combinations

def cohens_kappa(labels_a: list, labels_b: list) -> float:
    """Compute Cohen's Kappa between two annotators."""
    assert len(labels_a) == len(labels_b)
    n = len(labels_a)

    # Observed agreement
    observed = sum(a == b for a, b in zip(labels_a, labels_b)) / n

    # Expected agreement (by chance)
    unique_labels = set(labels_a) | set(labels_b)
    expected = 0
    for label in unique_labels:
        freq_a = labels_a.count(label) / n
        freq_b = labels_b.count(label) / n
        expected += freq_a * freq_b

    if expected == 1.0:
        return 1.0
    return (observed - expected) / (1 - expected)

def fleiss_kappa(ratings_matrix: np.ndarray) -> float:
    """Compute Fleiss' Kappa for multiple annotators.

    Args:
        ratings_matrix: Shape (n_subjects, n_categories).
                        Each cell is the count of raters who assigned
                        that category to that subject.
    """
    n_subjects, n_categories = ratings_matrix.shape
    n_raters = ratings_matrix.sum(axis=1)[0]  # Assume same per subject

    # Proportion of assignments to each category
    p_j = ratings_matrix.sum(axis=0) / (n_subjects * n_raters)

    # Per-subject agreement
    p_i = (
        (ratings_matrix ** 2).sum(axis=1) - n_raters
    ) / (n_raters * (n_raters - 1))

    p_bar = p_i.mean()
    p_e = (p_j ** 2).sum()

    if p_e == 1.0:
        return 1.0
    return (p_bar - p_e) / (1 - p_e)

# Example: Compare LLM labels with two human annotators
human_a = ["pos", "neg", "neu", "pos", "neg", "pos", "neu", "neg", "pos", "neu"]
human_b = ["pos", "neg", "pos", "pos", "neg", "pos", "neg", "neg", "pos", "neu"]
llm     = ["pos", "neg", "neu", "pos", "neg", "pos", "neu", "neg", "pos", "pos"]

kappa_humans = cohens_kappa(human_a, human_b)
kappa_llm_a = cohens_kappa(llm, human_a)
kappa_llm_b = cohens_kappa(llm, human_b)

print(f"Human A vs Human B (Kappa): {kappa_humans:.3f}")
print(f"LLM vs Human A (Kappa):     {kappa_llm_a:.3f}")
print(f"LLM vs Human B (Kappa):     {kappa_llm_b:.3f}")
print()
print("Interpretation:")
print("  0.81-1.00: Almost perfect agreement")
print("  0.61-0.80: Substantial agreement")
print("  0.41-0.60: Moderate agreement")
print("  0.21-0.40: Fair agreement")
print("  < 0.20:    Slight/poor agreement")

Human A vs Human B (Kappa): 0.538 LLM vs Human A (Kappa): 0.769 LLM vs Human B (Kappa): 0.385 Interpretation: 0.81-1.00: Almost perfect agreement 0.61-0.80: Substantial agreement 0.41-0.60: Moderate agreement 0.21-0.40: Fair agreement < 0.20: Slight/poor agreement

⚠ Warning

High LLM-human agreement does not always mean high quality. If the LLM and a single annotator agree strongly but disagree with other annotators, the LLM may be mimicking that annotator's biases rather than capturing ground truth. Always measure agreement against multiple independent annotators and investigate cases where LLM labels differ from the human majority vote.

📝 Knowledge Check

1. How much faster is reviewing a pre-label compared to labeling from scratch?

Show Answer

Studies consistently show that reviewing a pre-existing label is 2x to 5x faster than labeling from scratch, even when the pre-label has errors. The LLM gets most labels approximately right, and human annotators only need to identify and correct the mistakes rather than reasoning about classification from the beginning.

2. Why is LLM confidence calibration important for routing decisions?

Show Answer

LLM confidence scores are notoriously poorly calibrated. A model reporting 0.90 confidence may only be correct 75% of the time. Without calibration, you might auto-accept examples with high reported confidence that actually have unacceptable error rates. Calibration on a held-out set with gold labels maps reported confidence to actual accuracy, allowing you to set meaningful thresholds based on real error rates.

3. What is the difference between uncertainty sampling and diversity sampling in active learning?

Show Answer

Uncertainty sampling selects examples where the model's predictions are most uncertain (highest entropy), targeting examples near decision boundaries. Diversity sampling selects examples that are most different from the already-labeled set (using embedding distances), targeting unexplored regions of the input space. These strategies have very low overlap (typically only 5% to 10% of selected examples match), because they optimize for fundamentally different notions of "informative." The hybrid approach combines both with a weighting parameter.

4. Compare Label Studio, Prodigy, and Argilla for NLP annotation workflows.

Show Answer

Label Studio (Apache 2.0) is highly customizable and multi-modal, best for general-purpose annotation across text, image, and audio. Prodigy (commercial) excels at fast binary annotation with built-in active learning, best for rapid iterative labeling. Argilla (Apache 2.0) has native LLM/NLP focus with Hugging Face Hub integration and pairs with Distilabel, making it best for LLM output curation, preference labeling, and RLHF data workflows. All three support LLM pre-labeling integration through different mechanisms.

5. What does a Cohen's Kappa of 0.55 indicate, and what should you do about it?

Show Answer

A Cohen's Kappa of 0.55 indicates moderate agreement, which is often inadequate for training high quality models. Action steps include: (1) review and clarify annotation guidelines, especially for categories where disagreement is highest; (2) identify specific example types causing disagreement and add clarifying examples to the guidelines; (3) hold a calibration session where annotators discuss disagreements; (4) consider whether certain categories should be merged or whether the task definition needs revision; and (5) increase the number of annotators per example for the most ambiguous cases.

Key Takeaways

LLM pre-labeling speeds up annotation 2x to 5x by providing initial labels that human reviewers correct rather than create from scratch.
Confidence-based routing auto-accepts high-confidence labels and sends uncertain examples to humans. The threshold must be calibrated on held-out gold data, not based on the model's self-reported confidence.
Active learning reduces annotation costs by 40% to 70% by selecting the most informative examples. The hybrid strategy (uncertainty + diversity) captures value from both decision boundary refinement and input space exploration.
Three leading annotation tools serve different needs: Label Studio (general purpose, multi-modal), Prodigy (fast iterative labeling), and Argilla (LLM-native, RLHF data).
Inter-annotator agreement (Cohen's Kappa, Fleiss' Kappa) measures label quality. Always compare LLM labels against multiple independent human annotators, not just one.