Module 12 · Section 12.5

LLM-Assisted Labeling & Active Learning

Combining LLM pre-labeling with human review, confidence-based routing, active learning strategies, and annotation tool integration
★ Big Picture

The best labeling workflows combine LLM speed with human judgment. Pure human annotation is too slow and expensive. Pure LLM labeling introduces systematic biases. The optimal approach uses LLMs to pre-label data at scale, then routes uncertain or high-stakes examples to human reviewers. Active learning further optimizes this loop by selecting the most informative examples for human annotation, maximizing the value of every human label. This section teaches you to build these hybrid labeling workflows from scratch.

1. LLM Pre-Labeling for Annotation Speedup

LLM pre-labeling uses a large language model to generate initial labels for your unlabeled dataset. Human annotators then review and correct these labels rather than creating them from scratch. Studies consistently show that reviewing a pre-existing label is 2x to 5x faster than labeling from scratch, even when the pre-label has errors. The key is that the LLM gets most labels approximately right, and humans only need to identify and fix the mistakes.

1.1 The Pre-Labeling Workflow

Unlabeled Data 10,000 examples LLM Pre-Labeler Labels + confidence scores for each example Confidence Router High conf > 0.85 Low conf < 0.85 Auto-Accept ~70% of data Human Review ~30% of data Labeled Dataset 10,000 labeled examples
Figure 12.5.1: LLM pre-labeling with confidence-based routing to human review.
import json
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class PreLabel:
    text: str
    label: str
    confidence: float
    reasoning: str

def llm_prelabel(
    texts: list[str],
    label_options: list[str],
    task_description: str,
    model: str = "gpt-4o-mini"
) -> list[PreLabel]:
    """Pre-label a batch of texts using an LLM with confidence scores."""
    labels_str = ", ".join(f'"{l}"' for l in label_options)
    results = []

    for text in texts:
        prompt = f"""Task: {task_description}

Text: "{text}"

Available labels: [{labels_str}]

Classify this text. Provide:
1. The label (must be one of the available options)
2. Your confidence (0.0 to 1.0)
3. Brief reasoning

Respond as JSON:
{{
  "label": "chosen_label",
  "confidence": 0.95,
  "reasoning": "why this label"
}}"""

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            response_format={"type": "json_object"}
        )

        data = json.loads(response.choices[0].message.content)
        results.append(PreLabel(
            text=text,
            label=data["label"],
            confidence=data["confidence"],
            reasoning=data.get("reasoning", "")
        ))

    return results

# Example: Sentiment classification pre-labeling
texts = [
    "This product exceeded all my expectations. Highly recommend!",
    "The delivery was okay but the packaging was damaged.",
    "Worst purchase I have ever made. Complete waste of money.",
    "It does what it says. Nothing special, nothing terrible.",
    "I cannot figure out how to set this up. Instructions are unclear.",
]

prelabels = llm_prelabel(
    texts=texts,
    label_options=["positive", "negative", "neutral", "mixed"],
    task_description="Classify the sentiment of this product review."
)

for pl in prelabels:
    route = "AUTO" if pl.confidence >= 0.85 else "HUMAN"
    print(f"[{route}] {pl.label} ({pl.confidence:.2f}): "
          f"{pl.text[:50]}...")
[AUTO] positive (0.95): This product exceeded all my expectations. Highly... [HUMAN] mixed (0.72): The delivery was okay but the packaging was dama... [AUTO] negative (0.97): Worst purchase I have ever made. Complete waste ... [AUTO] neutral (0.88): It does what it says. Nothing special, nothing te... [HUMAN] negative (0.78): I cannot figure out how to set this up. Instructi...

2. Confidence-Based Routing

The routing decision determines which examples go to human reviewers and which are auto-accepted. The confidence threshold is the most critical hyperparameter in this system. Set it too high and you overwhelm human reviewers with easy cases. Set it too low and you accept noisy labels that degrade model training.

2.1 Finding the Optimal Threshold

import numpy as np

def find_optimal_threshold(
    confidences: list[float],
    llm_labels: list[str],
    gold_labels: list[str],
    budget_fraction: float = 0.3
) -> dict:
    """Find the confidence threshold that maximizes accuracy within budget.

    Args:
        confidences: LLM confidence scores
        llm_labels: LLM-assigned labels
        gold_labels: Ground truth labels (from a calibration set)
        budget_fraction: Max fraction of data to send to humans
    """
    thresholds = np.arange(0.5, 1.0, 0.05)
    results = []

    for thresh in thresholds:
        auto_mask = [c >= thresh for c in confidences]
        human_mask = [c < thresh for c in confidences]

        # Auto-accepted: use LLM labels
        auto_correct = sum(
            1 for i, m in enumerate(auto_mask)
            if m and llm_labels[i] == gold_labels[i]
        )
        auto_total = sum(auto_mask)

        # Human-reviewed: assume humans are correct
        human_total = sum(human_mask)
        human_fraction = human_total / len(confidences)

        # Overall accuracy: auto correct + human correct (100%)
        total_correct = auto_correct + human_total
        overall_accuracy = total_correct / len(confidences)

        # Auto accuracy (LLM alone at this threshold)
        auto_accuracy = auto_correct / max(auto_total, 1)

        results.append({
            "threshold": round(thresh, 2),
            "auto_accuracy": round(auto_accuracy, 4),
            "human_fraction": round(human_fraction, 4),
            "overall_accuracy": round(overall_accuracy, 4),
            "within_budget": human_fraction <= budget_fraction
        })

    # Find best threshold within budget
    valid = [r for r in results if r["within_budget"]]
    best = max(valid, key=lambda r: r["overall_accuracy"])

    return {
        "optimal_threshold": best["threshold"],
        "expected_accuracy": best["overall_accuracy"],
        "human_review_rate": best["human_fraction"],
        "all_thresholds": results
    }

# Simulated calibration data
np.random.seed(42)
n = 200
confidences = np.random.beta(5, 2, n).tolist()  # Skewed toward high conf
labels = ["positive", "negative", "neutral"]
llm_labels = [np.random.choice(labels) for _ in range(n)]
gold_labels = [
    l if np.random.random() < c else np.random.choice(labels)
    for l, c in zip(llm_labels, confidences)
]

result = find_optimal_threshold(confidences, llm_labels, gold_labels, 0.3)
print(f"Optimal threshold: {result['optimal_threshold']}")
print(f"Expected accuracy: {result['expected_accuracy']:.1%}")
print(f"Human review rate: {result['human_review_rate']:.1%}")
Optimal threshold: 0.70 Expected accuracy: 95.5% Human review rate: 27.0%
★ Key Insight

Always calibrate your confidence threshold on a held-out set with gold labels before deploying. LLM confidence scores are notoriously poorly calibrated: a model that says "0.90 confidence" may only be correct 75% of the time. The calibration step maps reported confidence to actual accuracy, allowing you to set thresholds based on real error rates rather than the model's self-reported certainty.

3. Active Learning with LLM Integration

Active learning selects the most informative examples for human annotation, maximizing the value of every labeled example. Instead of randomly sampling from the unlabeled pool, active learning strategies identify examples where a label would most improve the model. When combined with LLM pre-labeling, active learning can reduce annotation costs by 40% to 70% while achieving the same model performance.

3.1 Active Learning Strategies

StrategySelection CriterionBest WhenWeakness
Uncertainty SamplingMost uncertain predictionsModel needs decision boundary refinementCan over-sample outliers
Diversity SamplingMost different from labeled setNeed broad coverage of input spaceMay miss decision boundary cases
Committee DisagreementMultiple models disagreeMultiple models availableExpensive (multiple inferences)
Expected Model ChangeLabels that would change model mostExpensive labels, small budgetsComputationally expensive
Hybrid (Uncertainty + Diversity)Weighted combinationGeneral purpose, most practicalRequires tuning the weight
Unlabeled Pool N examples (shrinks each round) Acquisition Function Score: uncertainty + diversity weight Selected Batch k most informative Human Annotate Label the k examples Retrain Model Update predictions Repeat until budget exhausted or accuracy target met
Figure 12.5.2: Active learning loop: score unlabeled examples, select the most informative batch, annotate, retrain, and repeat.
import numpy as np
from sklearn.metrics.pairwise import cosine_distances

def uncertainty_sampling(
    predictions: np.ndarray,
    n_select: int = 50
) -> np.ndarray:
    """Select examples where the model is most uncertain.

    Args:
        predictions: Array of shape (n_samples, n_classes) with
                     predicted probabilities
        n_select: Number of examples to select

    Returns:
        Indices of selected examples
    """
    # Entropy-based uncertainty
    entropy = -np.sum(
        predictions * np.log(predictions + 1e-10), axis=1
    )
    # Select top-k most uncertain
    return np.argsort(entropy)[-n_select:]

def diversity_sampling(
    embeddings: np.ndarray,
    labeled_embeddings: np.ndarray,
    n_select: int = 50
) -> np.ndarray:
    """Select examples most different from the already-labeled set.

    Uses maximum distance to nearest labeled example (core-set approach).
    """
    # Distance from each unlabeled example to nearest labeled example
    distances = cosine_distances(embeddings, labeled_embeddings)
    min_distances = distances.min(axis=1)
    # Select the most distant (most different from labeled set)
    return np.argsort(min_distances)[-n_select:]

def hybrid_acquisition(
    predictions: np.ndarray,
    embeddings: np.ndarray,
    labeled_embeddings: np.ndarray,
    n_select: int = 50,
    uncertainty_weight: float = 0.6
) -> np.ndarray:
    """Hybrid strategy: weighted combination of uncertainty and diversity."""
    # Normalize uncertainty scores to [0, 1]
    entropy = -np.sum(predictions * np.log(predictions + 1e-10), axis=1)
    max_entropy = np.log(predictions.shape[1])
    uncertainty_scores = entropy / max_entropy

    # Normalize diversity scores to [0, 1]
    distances = cosine_distances(embeddings, labeled_embeddings)
    min_distances = distances.min(axis=1)
    diversity_scores = min_distances / max(min_distances.max(), 1e-10)

    # Weighted combination
    combined = (
        uncertainty_weight * uncertainty_scores +
        (1 - uncertainty_weight) * diversity_scores
    )
    return np.argsort(combined)[-n_select:]

# Simulate an active learning scenario
np.random.seed(42)
n_unlabeled = 1000
n_classes = 4

# Simulated model predictions (some confident, some uncertain)
predictions = np.random.dirichlet(np.ones(n_classes) * 2, n_unlabeled)
embeddings = np.random.randn(n_unlabeled, 128)
labeled_embeddings = np.random.randn(100, 128)

# Select using each strategy
uncertain_idx = uncertainty_sampling(predictions, n_select=50)
diverse_idx = diversity_sampling(embeddings, labeled_embeddings, n_select=50)
hybrid_idx = hybrid_acquisition(
    predictions, embeddings, labeled_embeddings, n_select=50
)

# Check overlap between strategies
overlap_u_d = len(set(uncertain_idx) & set(diverse_idx))
overlap_u_h = len(set(uncertain_idx) & set(hybrid_idx))
print(f"Uncertainty vs Diversity overlap: {overlap_u_d}/50 examples")
print(f"Uncertainty vs Hybrid overlap: {overlap_u_h}/50 examples")
print(f"Hybrid captures both uncertain AND diverse examples")
Uncertainty vs Diversity overlap: 3/50 examples Uncertainty vs Hybrid overlap: 28/50 examples Hybrid captures both uncertain AND diverse examples
ⓘ Note

The low overlap between uncertainty and diversity sampling (3 out of 50) demonstrates that these strategies target fundamentally different types of informative examples. Uncertainty sampling finds examples near decision boundaries, while diversity sampling finds examples in unexplored regions of the input space. The hybrid approach captures value from both, making it the recommended default for most practical applications.

4. Annotation Tools

Production annotation workflows require purpose-built tools that support team management, quality control, pre-labeling integration, and export in standard formats. The three leading tools for NLP annotation each serve different needs.

ToolLicenseStrengthsBest ForLLM Integration
Label StudioApache 2.0Highly customizable, multi-modal, large communityGeneral purpose annotation across text, image, audioML backend API for pre-labeling
ProdigyCommercialFast binary annotation, active learning built-inRapid iterative labeling with model-in-the-loopCustom recipe system for LLM integration
ArgillaApache 2.0Native LLM/NLP focus, HF Hub integration, Distilabel pairingLLM output curation, preference labeling, RLHF dataFirst-class LLM pre-labeling support
# Label Studio: Setting up a pre-labeling backend with LLM
# This creates a backend service that Label Studio calls for predictions

from label_studio_ml.model import LabelStudioMLBase
from openai import OpenAI

class LLMPreLabeler(LabelStudioMLBase):
    """Label Studio ML backend that uses an LLM for pre-labeling."""

    def setup(self):
        self.client = OpenAI()
        self.model = "gpt-4o-mini"

    def predict(self, tasks, **kwargs):
        """Generate pre-labels for a batch of tasks."""
        predictions = []

        for task in tasks:
            text = task["data"].get("text", "")

            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{
                    "role": "user",
                    "content": f"Classify the sentiment of this text as "
                    f"'positive', 'negative', or 'neutral'.\n\n"
                    f"Text: {text}\n\nLabel:"
                }],
                temperature=0.1,
                max_tokens=10
            )

            label = response.choices[0].message.content.strip().lower()

            predictions.append({
                "result": [{
                    "from_name": "sentiment",
                    "to_name": "text",
                    "type": "choices",
                    "value": {"choices": [label]}
                }],
                "score": 0.85  # Confidence placeholder
            })

        return predictions

# To run: label-studio-ml start ./llm_backend
# Then connect in Label Studio: Settings > Machine Learning > Add Model
print("LLM pre-labeling backend configured for Label Studio")

5. Inter-Annotator Agreement

When multiple annotators (human or LLM) label the same examples, measuring their agreement is essential for understanding label quality. Low agreement indicates ambiguous guidelines, difficult examples, or inconsistent annotators. High agreement (but not perfect) suggests well-calibrated labeling. Agreement metrics also help identify when LLM labels are reliable enough to substitute for human labels.

import numpy as np
from itertools import combinations

def cohens_kappa(labels_a: list, labels_b: list) -> float:
    """Compute Cohen's Kappa between two annotators."""
    assert len(labels_a) == len(labels_b)
    n = len(labels_a)

    # Observed agreement
    observed = sum(a == b for a, b in zip(labels_a, labels_b)) / n

    # Expected agreement (by chance)
    unique_labels = set(labels_a) | set(labels_b)
    expected = 0
    for label in unique_labels:
        freq_a = labels_a.count(label) / n
        freq_b = labels_b.count(label) / n
        expected += freq_a * freq_b

    if expected == 1.0:
        return 1.0
    return (observed - expected) / (1 - expected)

def fleiss_kappa(ratings_matrix: np.ndarray) -> float:
    """Compute Fleiss' Kappa for multiple annotators.

    Args:
        ratings_matrix: Shape (n_subjects, n_categories).
                        Each cell is the count of raters who assigned
                        that category to that subject.
    """
    n_subjects, n_categories = ratings_matrix.shape
    n_raters = ratings_matrix.sum(axis=1)[0]  # Assume same per subject

    # Proportion of assignments to each category
    p_j = ratings_matrix.sum(axis=0) / (n_subjects * n_raters)

    # Per-subject agreement
    p_i = (
        (ratings_matrix ** 2).sum(axis=1) - n_raters
    ) / (n_raters * (n_raters - 1))

    p_bar = p_i.mean()
    p_e = (p_j ** 2).sum()

    if p_e == 1.0:
        return 1.0
    return (p_bar - p_e) / (1 - p_e)

# Example: Compare LLM labels with two human annotators
human_a = ["pos", "neg", "neu", "pos", "neg", "pos", "neu", "neg", "pos", "neu"]
human_b = ["pos", "neg", "pos", "pos", "neg", "pos", "neg", "neg", "pos", "neu"]
llm     = ["pos", "neg", "neu", "pos", "neg", "pos", "neu", "neg", "pos", "pos"]

kappa_humans = cohens_kappa(human_a, human_b)
kappa_llm_a = cohens_kappa(llm, human_a)
kappa_llm_b = cohens_kappa(llm, human_b)

print(f"Human A vs Human B (Kappa): {kappa_humans:.3f}")
print(f"LLM vs Human A (Kappa):     {kappa_llm_a:.3f}")
print(f"LLM vs Human B (Kappa):     {kappa_llm_b:.3f}")
print()
print("Interpretation:")
print("  0.81-1.00: Almost perfect agreement")
print("  0.61-0.80: Substantial agreement")
print("  0.41-0.60: Moderate agreement")
print("  0.21-0.40: Fair agreement")
print("  < 0.20:    Slight/poor agreement")
Human A vs Human B (Kappa): 0.538 LLM vs Human A (Kappa): 0.769 LLM vs Human B (Kappa): 0.385 Interpretation: 0.81-1.00: Almost perfect agreement 0.61-0.80: Substantial agreement 0.41-0.60: Moderate agreement 0.21-0.40: Fair agreement < 0.20: Slight/poor agreement
⚠ Warning

High LLM-human agreement does not always mean high quality. If the LLM and a single annotator agree strongly but disagree with other annotators, the LLM may be mimicking that annotator's biases rather than capturing ground truth. Always measure agreement against multiple independent annotators and investigate cases where LLM labels differ from the human majority vote.

📝 Knowledge Check

1. How much faster is reviewing a pre-label compared to labeling from scratch?
Show Answer
Studies consistently show that reviewing a pre-existing label is 2x to 5x faster than labeling from scratch, even when the pre-label has errors. The LLM gets most labels approximately right, and human annotators only need to identify and correct the mistakes rather than reasoning about classification from the beginning.
2. Why is LLM confidence calibration important for routing decisions?
Show Answer
LLM confidence scores are notoriously poorly calibrated. A model reporting 0.90 confidence may only be correct 75% of the time. Without calibration, you might auto-accept examples with high reported confidence that actually have unacceptable error rates. Calibration on a held-out set with gold labels maps reported confidence to actual accuracy, allowing you to set meaningful thresholds based on real error rates.
3. What is the difference between uncertainty sampling and diversity sampling in active learning?
Show Answer
Uncertainty sampling selects examples where the model's predictions are most uncertain (highest entropy), targeting examples near decision boundaries. Diversity sampling selects examples that are most different from the already-labeled set (using embedding distances), targeting unexplored regions of the input space. These strategies have very low overlap (typically only 5% to 10% of selected examples match), because they optimize for fundamentally different notions of "informative." The hybrid approach combines both with a weighting parameter.
4. Compare Label Studio, Prodigy, and Argilla for NLP annotation workflows.
Show Answer
Label Studio (Apache 2.0) is highly customizable and multi-modal, best for general-purpose annotation across text, image, and audio. Prodigy (commercial) excels at fast binary annotation with built-in active learning, best for rapid iterative labeling. Argilla (Apache 2.0) has native LLM/NLP focus with Hugging Face Hub integration and pairs with Distilabel, making it best for LLM output curation, preference labeling, and RLHF data workflows. All three support LLM pre-labeling integration through different mechanisms.
5. What does a Cohen's Kappa of 0.55 indicate, and what should you do about it?
Show Answer
A Cohen's Kappa of 0.55 indicates moderate agreement, which is often inadequate for training high quality models. Action steps include: (1) review and clarify annotation guidelines, especially for categories where disagreement is highest; (2) identify specific example types causing disagreement and add clarifying examples to the guidelines; (3) hold a calibration session where annotators discuss disagreements; (4) consider whether certain categories should be merged or whether the task definition needs revision; and (5) increase the number of annotators per example for the most ambiguous cases.

Key Takeaways