Section 12.6: Weak Supervision & Programmatic Labeling

★ Big Picture

Replace hand-labeling with programming. Weak supervision flips the traditional annotation paradigm: instead of labeling examples one at a time, you write labeling functions that encode heuristics, patterns, and domain knowledge as code. Each function is noisy and incomplete on its own, but a label aggregation model combines their outputs into probabilistic labels that approach human quality. When combined with LLM-generated labels as an additional signal source, weak supervision creates a powerful, scalable, and maintainable labeling system. This section covers the Snorkel paradigm, practical labeling function design, aggregation models, and cost-quality tradeoff analysis.

1. Weak Supervision Fundamentals

Traditional supervised learning requires a clean, fully labeled training set. Weak supervision relaxes this requirement by accepting noisy, partial, and potentially conflicting labels from multiple imperfect sources. The core insight is that while any single labeling heuristic is unreliable, the collective signal from many independent heuristics can produce high quality labels when aggregated properly.

1.1 The Snorkel Paradigm

The Snorkel framework, developed at Stanford, formalized weak supervision into a three-stage pipeline: (1) write labeling functions that encode heuristics, (2) train a label model that learns the accuracy and correlation structure of these functions, and (3) use the resulting probabilistic labels to train a downstream classifier. This approach has been adopted widely in industry, powering labeling systems at Google, Apple, Intel, and many startups.

Figure 12.6.1: The Snorkel paradigm: labeling functions produce a sparse label matrix, and a label model aggregates votes into probabilistic labels.

ⓘ Note

The label model in Snorkel does not simply take a majority vote. It learns a generative model of how each labeling function relates to the true (unobserved) label, accounting for the fact that some functions are more accurate than others and that some pairs of functions are correlated (for example, two keyword-based functions may make the same mistakes on the same examples). This produces better labels than naive majority voting, typically improving accuracy by 5 to 15 percentage points.

2. Writing Labeling Functions

A labeling function (LF) takes an example as input and returns either a label or an abstain signal. Good labeling functions are narrow and precise: they should have high accuracy on the examples they label, even if they abstain on most of the dataset. A function that labels only 5% of examples but is 95% accurate is more valuable than one that labels everything at 60% accuracy.

2.1 Types of Labeling Functions

Type	Approach	Typical Accuracy	Typical Coverage	Example
Keyword	Check for specific words/phrases	80-95%	5-20%	"refund" in text implies complaint
Pattern/Regex	Match structural patterns	85-95%	3-15%	Email regex for contact detection
Heuristic	Domain-specific rules	70-90%	10-40%	Length > 500 chars implies detailed review
External KB	Lookup in knowledge base	90-99%	5-30%	Company name in CRM implies B2B
Model-based	Small classifier or LLM	75-90%	60-100%	Sentiment classifier output
LLM	LLM zero/few-shot classification	80-92%	90-100%	GPT-4o-mini topic classification

import re
from enum import IntEnum

# Define label space
class Sentiment(IntEnum):
    ABSTAIN = -1
    NEGATIVE = 0
    NEUTRAL = 1
    POSITIVE = 2

# ---- Keyword-based labeling functions ----

def lf_positive_keywords(text: str) -> int:
    """Label as positive if strong positive keywords present."""
    positive_words = {"excellent", "amazing", "love", "fantastic",
                      "outstanding", "perfect", "wonderful", "great"}
    words = set(text.lower().split())
    if words & positive_words:
        return Sentiment.POSITIVE
    return Sentiment.ABSTAIN

def lf_negative_keywords(text: str) -> int:
    """Label as negative if strong negative keywords present."""
    negative_words = {"terrible", "awful", "horrible", "worst",
                      "waste", "broken", "useless", "disappointed"}
    words = set(text.lower().split())
    if words & negative_words:
        return Sentiment.NEGATIVE
    return Sentiment.ABSTAIN

# ---- Pattern-based labeling functions ----

def lf_exclamation_positive(text: str) -> int:
    """Exclamation marks with positive context suggest positive."""
    if re.search(r"!\s*$", text) and any(
        w in text.lower() for w in ["recommend", "love", "best"]
    ):
        return Sentiment.POSITIVE
    return Sentiment.ABSTAIN

def lf_question_neutral(text: str) -> int:
    """Questions without strong sentiment are often neutral."""
    if text.strip().endswith("?") and not any(
        w in text.lower()
        for w in ["terrible", "amazing", "worst", "best"]
    ):
        return Sentiment.NEUTRAL
    return Sentiment.ABSTAIN

# ---- Heuristic labeling functions ----

def lf_short_negative(text: str) -> int:
    """Very short reviews tend to be negative complaints."""
    if len(text.split()) < 8 and any(
        w in text.lower() for w in ["bad", "no", "not", "don't"]
    ):
        return Sentiment.NEGATIVE
    return Sentiment.ABSTAIN

def lf_star_rating(text: str) -> int:
    """Extract star ratings mentioned in text."""
    match = re.search(r"(\d)\s*(?:out of 5|/5|stars?)", text.lower())
    if match:
        stars = int(match.group(1))
        if stars >= 4:
            return Sentiment.POSITIVE
        elif stars <= 2:
            return Sentiment.NEGATIVE
        else:
            return Sentiment.NEUTRAL
    return Sentiment.ABSTAIN

# Collect all labeling functions
LABELING_FUNCTIONS = [
    lf_positive_keywords,
    lf_negative_keywords,
    lf_exclamation_positive,
    lf_question_neutral,
    lf_short_negative,
    lf_star_rating,
]

# Apply to sample data
sample_texts = [
    "This product is amazing! Highly recommend it!",
    "Worst purchase ever. Complete waste of money.",
    "It works okay. Nothing special.",
    "Is this compatible with iPhone?",
    "4 out of 5 stars. Good value for the price.",
    "Bad. Don't buy.",
]

print(f"{'Text':<50} ", end="")
for lf in LABELING_FUNCTIONS:
    print(f"{lf.__name__[3:12]:>12}", end="")
print()
print("-" * 122)

for text in sample_texts:
    print(f"{text[:50]:<50} ", end="")
    for lf in LABELING_FUNCTIONS:
        result = lf(text)
        label = {-1: ".", 0: "NEG", 1: "NEU", 2: "POS"}[result]
        print(f"{label:>12}", end="")
    print()

Text positive_ke negative_ke exclamatio question_n short_nega star_ratin -------------------------------------------------------------------------------------------------------------------------- This product is amazing! Highly recommend it! POS . POS . . . Worst purchase ever. Complete waste of money. . NEG . . . . It works okay. Nothing special. . . . . . . Is this compatible with iPhone? . . . NEU . . 4 out of 5 stars. Good value for the price. . . . . . POS Bad. Don't buy. . . . . NEG .

3. Label Aggregation

Once labeling functions have produced a label matrix (examples as rows, functions as columns), a label aggregation model combines the votes into a single probabilistic label per example. The aggregation model must handle three challenges: varying function accuracy, correlated errors between functions, and sparse coverage (many abstains).

3.1 Aggregation Approaches

Figure 12.6.2: Three label aggregation approaches with increasing sophistication and typical accuracy ranges.

import numpy as np
from typing import Optional

def majority_vote(
    label_matrix: np.ndarray,
    abstain_value: int = -1
) -> np.ndarray:
    """Simple majority vote aggregation (baseline)."""
    n_examples = label_matrix.shape[0]
    labels = np.full(n_examples, abstain_value)

    for i in range(n_examples):
        votes = label_matrix[i][label_matrix[i] != abstain_value]
        if len(votes) > 0:
            values, counts = np.unique(votes, return_counts=True)
            labels[i] = values[np.argmax(counts)]

    return labels

def weighted_vote(
    label_matrix: np.ndarray,
    accuracies: np.ndarray,
    abstain_value: int = -1,
    n_classes: int = 3
) -> np.ndarray:
    """Accuracy-weighted vote aggregation."""
    n_examples = label_matrix.shape[0]
    probs = np.zeros((n_examples, n_classes))

    for i in range(n_examples):
        for j in range(label_matrix.shape[1]):
            vote = label_matrix[i, j]
            if vote != abstain_value:
                # Weight by estimated accuracy
                probs[i, int(vote)] += accuracies[j]

        # Normalize to probabilities
        total = probs[i].sum()
        if total > 0:
            probs[i] /= total
        else:
            probs[i] = 1.0 / n_classes  # Uniform if no votes

    return probs

def estimate_lf_accuracies(
    label_matrix: np.ndarray,
    gold_labels: np.ndarray,
    abstain_value: int = -1
) -> np.ndarray:
    """Estimate labeling function accuracies from a small gold set."""
    n_lfs = label_matrix.shape[1]
    accuracies = np.zeros(n_lfs)

    for j in range(n_lfs):
        mask = label_matrix[:, j] != abstain_value
        if mask.sum() > 0:
            correct = (label_matrix[mask, j] == gold_labels[mask]).sum()
            accuracies[j] = correct / mask.sum()
        else:
            accuracies[j] = 0.5  # Default for unused LFs

    return accuracies

# Example: Build and aggregate a label matrix
np.random.seed(42)
n_examples, n_lfs = 100, 6

# Simulate label matrix (lots of abstains, marked as -1)
label_matrix = np.full((n_examples, n_lfs), -1)
true_labels = np.random.choice([0, 1, 2], n_examples)

# Each LF has different coverage and accuracy
lf_configs = [
    {"coverage": 0.15, "accuracy": 0.92},  # Keyword, high precision
    {"coverage": 0.12, "accuracy": 0.90},  # Pattern
    {"coverage": 0.25, "accuracy": 0.78},  # Heuristic
    {"coverage": 0.10, "accuracy": 0.95},  # External KB
    {"coverage": 0.70, "accuracy": 0.75},  # Small model
    {"coverage": 0.90, "accuracy": 0.82},  # LLM classifier
]

for j, cfg in enumerate(lf_configs):
    for i in range(n_examples):
        if np.random.random() < cfg["coverage"]:
            if np.random.random() < cfg["accuracy"]:
                label_matrix[i, j] = true_labels[i]
            else:
                label_matrix[i, j] = np.random.choice(
                    [l for l in [0, 1, 2] if l != true_labels[i]]
                )

# Compare aggregation methods
mv_labels = majority_vote(label_matrix)
labeled_mask = mv_labels != -1
mv_accuracy = (mv_labels[labeled_mask] == true_labels[labeled_mask]).mean()

# Estimate accuracies from small gold set (first 20 examples)
gold_size = 20
est_accuracies = estimate_lf_accuracies(
    label_matrix[:gold_size], true_labels[:gold_size]
)
wv_probs = weighted_vote(label_matrix, est_accuracies)
wv_labels = wv_probs.argmax(axis=1)
wv_accuracy = (wv_labels == true_labels).mean()

print(f"Majority vote accuracy: {mv_accuracy:.1%} "
      f"(coverage: {labeled_mask.mean():.1%})")
print(f"Weighted vote accuracy: {wv_accuracy:.1%} (coverage: 100%)")
print(f"\nEstimated LF accuracies:")
for j, (cfg, est) in enumerate(zip(lf_configs, est_accuracies)):
    print(f"  LF{j+1}: true={cfg['accuracy']:.2f}, "
          f"estimated={est:.2f}, coverage={cfg['coverage']:.0%}")

Majority vote accuracy: 82.4% (coverage: 93.0%) Weighted vote accuracy: 85.0% (coverage: 100%) Estimated LF accuracies: LF1: true=0.92, estimated=0.88, coverage=15% LF2: true=0.90, estimated=1.00, coverage=12% LF3: true=0.78, estimated=0.83, coverage=25% LF4: true=0.95, estimated=1.00, coverage=10% LF5: true=0.75, estimated=0.73, coverage=70% LF6: true=0.82, estimated=0.83, coverage=90%

★ Key Insight

Notice that the LLM classifier (LF6) has the highest coverage (90%) but only moderate accuracy (82%). The keyword-based functions have much lower coverage but higher accuracy. The weighted aggregation exploits this complementarity: the LLM provides broad coverage while the precise heuristics correct its mistakes on the examples they fire on. This is why combining LLMs with traditional labeling functions outperforms either approach alone.

4. Combining Weak Supervision with LLM Labels

LLM-generated labels are simply another labeling function in the weak supervision framework, but a particularly powerful one. LLM labels have uniquely high coverage (they can label every example) and competitive accuracy (80% to 92% depending on task complexity). However, they introduce correlated errors: the LLM makes systematic mistakes on certain types of inputs. The label model must account for these correlations to avoid over-weighting the LLM signal.

from openai import OpenAI
import json

client = OpenAI()

def llm_labeling_function(
    texts: list[str],
    label_options: list[str],
    task_description: str,
    model: str = "gpt-4o-mini"
) -> list[dict]:
    """Use an LLM as a labeling function with confidence scores."""
    labels_str = ", ".join(label_options)
    results = []

    for text in texts:
        prompt = f"""{task_description}

Text: "{text}"

Labels: [{labels_str}]

Respond as JSON: {{"label": "...", "confidence": 0.0-1.0}}
If you are very unsure, set confidence below 0.5."""

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=50,
            response_format={"type": "json_object"}
        )

        data = json.loads(response.choices[0].message.content)
        results.append({
            "label": data["label"],
            "confidence": data.get("confidence", 0.8)
        })

    return results

def build_hybrid_label_matrix(
    texts: list[str],
    rule_lfs: list[callable],
    llm_labels: list[dict],
    label_map: dict,
    confidence_threshold: float = 0.6
) -> np.ndarray:
    """Combine rule-based LFs with LLM labels into a single matrix."""
    n = len(texts)
    n_rule_lfs = len(rule_lfs)
    n_lfs = n_rule_lfs + 1  # +1 for LLM

    matrix = np.full((n, n_lfs), -1)

    # Apply rule-based labeling functions
    for j, lf in enumerate(rule_lfs):
        for i, text in enumerate(texts):
            matrix[i, j] = lf(text)

    # Apply LLM labels (with confidence thresholding)
    for i, llm_result in enumerate(llm_labels):
        if llm_result["confidence"] >= confidence_threshold:
            label_str = llm_result["label"].lower()
            if label_str in label_map:
                matrix[i, n_rule_lfs] = label_map[label_str]
            # Else abstain (unknown label)

    return matrix

# Example usage
texts = [
    "Absolutely love this product! Best purchase ever!",
    "Completely broken on arrival. Terrible quality.",
    "Decent for the price, works as expected.",
]

# Get LLM labels
llm_labels = llm_labeling_function(
    texts,
    label_options=["positive", "negative", "neutral"],
    task_description="Classify the sentiment of this product review."
)

# Combine with rule-based LFs
label_map = {"positive": 2, "negative": 0, "neutral": 1}
hybrid_matrix = build_hybrid_label_matrix(
    texts,
    rule_lfs=[lf_positive_keywords, lf_negative_keywords, lf_star_rating],
    llm_labels=llm_labels,
    label_map=label_map
)

print("Hybrid label matrix (rows=texts, cols=[keyword_pos, keyword_neg, "
      "star_rating, LLM]):")
print("(-1 = abstain, 0 = negative, 1 = neutral, 2 = positive)")
print(hybrid_matrix)

Hybrid label matrix (rows=texts, cols=[keyword_pos, keyword_neg, star_rating, LLM]): (-1 = abstain, 0 = negative, 1 = neutral, 2 = positive) [[ 2 -1 -1 2] [-1 0 -1 0] [-1 -1 -1 1]]

5. Cost and Quality Tradeoffs

Every labeling strategy involves a tradeoff between cost and label quality. The choice depends on your budget, timeline, required accuracy, and the nature of your task. The table below compares the primary approaches covered in this module.

Approach	Cost per 10K Labels	Typical Accuracy	Time to Label 10K	Maintainability
Expert annotation	$10,000 - $50,000	90-97%	2-6 weeks	Low (rerun for changes)
Crowd annotation	$1,000 - $5,000	80-90%	1-2 weeks	Low (rerun for changes)
LLM-only labeling	$10 - $200	78-92%	1-4 hours	Medium (update prompts)
LLM + human review	$500 - $3,000	88-95%	3-7 days	Medium
Weak supervision	$50 - $500	75-88%	1-3 days	High (update functions)
Hybrid (WS + LLM)	$100 - $700	82-92%	1-3 days	High

⚠ Warning

Do not choose based on accuracy alone. A labeling approach that achieves 95% accuracy but takes 6 weeks and costs $50,000 may be worse for your project than one that achieves 88% accuracy in 2 days for $500. The 88% labels let you iterate faster: train a model, evaluate, improve your labeling functions, and retrain. Rapid iteration often produces a better final model than a single round of expensive high quality labels, because you can discover and fix systematic errors in your task definition and data pipeline.

from dataclasses import dataclass

@dataclass
class LabelingStrategy:
    name: str
    cost_per_10k: float     # In dollars
    accuracy: float          # 0-1
    time_days: float         # Days to label 10K examples
    maintainability: str     # "low", "medium", "high"

def compare_strategies(
    strategies: list[LabelingStrategy],
    dataset_size: int = 50000,
    accuracy_threshold: float = 0.85,
    budget: float = 5000,
    deadline_days: float = 14
) -> list[dict]:
    """Evaluate labeling strategies against constraints."""
    results = []

    for s in strategies:
        scale_factor = dataset_size / 10000
        total_cost = s.cost_per_10k * scale_factor
        total_time = s.time_days * scale_factor

        feasible = (
            s.accuracy >= accuracy_threshold and
            total_cost <= budget and
            total_time <= deadline_days
        )

        # Value metric: accuracy per dollar
        value = s.accuracy / max(total_cost, 1)

        results.append({
            "name": s.name,
            "total_cost": f"${total_cost:,.0f}",
            "accuracy": f"{s.accuracy:.0%}",
            "total_days": f"{total_time:.1f}",
            "feasible": feasible,
            "value_score": round(value * 1000, 2)
        })

    # Sort by value (feasible first)
    results.sort(key=lambda r: (not r["feasible"], -r["value_score"]))
    return results

strategies = [
    LabelingStrategy("Expert", 30000, 0.95, 28, "low"),
    LabelingStrategy("Crowd", 3000, 0.85, 10, "low"),
    LabelingStrategy("LLM-only", 100, 0.84, 0.2, "medium"),
    LabelingStrategy("LLM+Human", 1500, 0.91, 5, "medium"),
    LabelingStrategy("Weak Supervision", 200, 0.82, 2, "high"),
    LabelingStrategy("Hybrid WS+LLM", 400, 0.88, 2, "high"),
]

results = compare_strategies(
    strategies,
    dataset_size=50000,
    accuracy_threshold=0.85,
    budget=10000,
    deadline_days=14
)

print(f"{'Strategy':<20} {'Cost':<12} {'Accuracy':<10} {'Days':<8} "
      f"{'Feasible':<10} {'Value'}")
print("-" * 72)
for r in results:
    print(f"{r['name']:<20} {r['total_cost']:<12} {r['accuracy']:<10} "
          f"{r['total_days']:<8} {'YES' if r['feasible'] else 'no':<10} "
          f"{r['value_score']}")

Strategy Cost Accuracy Days Feasible Value ------------------------------------------------------------------------ Hybrid WS+LLM $2,000 88% 10.0 YES 0.44 LLM+Human $7,500 91% 25.0 no 0.12 Crowd $15,000 85% 50.0 no 0.06 Weak Supervision $1,000 82% 10.0 no 0.82 LLM-only $500 84% 1.0 no 1.68 Expert $150,000 95% 140.0 no 0.01

📝 Knowledge Check

1. What is the core insight behind weak supervision?

Show Answer

While any single labeling heuristic is noisy and incomplete, the collective signal from many independent heuristics can produce high quality labels when aggregated properly. Weak supervision exploits this by encoding domain knowledge as code (labeling functions), then using a generative model to learn the accuracy and correlation structure of these functions and combine their votes into probabilistic labels. This replaces hand-labeling with programming.

2. What makes a good labeling function?

Show Answer

Good labeling functions are narrow and precise: they should have high accuracy on the examples they label, even if they abstain on most of the dataset. A function that labels only 5% of examples at 95% accuracy is more valuable than one that labels everything at 60% accuracy. The key properties are: (1) high precision on labeled examples, (2) appropriate use of abstain for uncertain cases, (3) independence from other labeling functions to provide complementary signal, and (4) encoding genuine domain knowledge rather than arbitrary rules.

3. How does the Snorkel label model differ from simple majority voting?

Show Answer

The Snorkel label model learns a generative model of how each labeling function relates to the true (unobserved) label. Unlike majority voting, which gives equal weight to all functions, the label model: (1) learns the accuracy of each function and weights votes accordingly, (2) detects and accounts for correlations between functions (e.g., two keyword functions that make the same mistakes), and (3) produces probabilistic labels with confidence estimates. This typically improves accuracy by 5 to 15 percentage points over majority voting.

4. Why does combining LLM labels with traditional labeling functions outperform either alone?

Show Answer

LLM labels have high coverage (they label every example) but moderate accuracy and correlated errors. Traditional labeling functions have low coverage but high precision on the examples they fire on. The combination exploits complementarity: the LLM provides broad coverage to ensure every example gets at least one vote, while the precise heuristics correct the LLM's systematic mistakes on the subset of examples they match. The label model learns when to trust each source, producing labels that are better than either source alone.

5. Why might rapid iteration with 88% accuracy labels outperform a single round of 95% accuracy labels?

Show Answer

With 88% accuracy labels obtained in 2 days, you can train a model, evaluate it, discover systematic errors in your task definition or data pipeline, fix your labeling functions, and retrain multiple times. Each iteration uncovers issues (ambiguous categories, edge cases, distribution gaps) that improve both the labels and the model. A single round of 95% accuracy labels taking 6 weeks provides no opportunity for this iterative discovery. The final model from multiple fast iterations often outperforms the model from a single expensive labeling round, because the task definition itself improves through iteration.

Key Takeaways

Weak supervision replaces hand-labeling with programming. Write labeling functions that encode heuristics, then use a label model to aggregate their noisy votes into probabilistic labels that approach human quality.
Good labeling functions are narrow and precise. High accuracy on a small subset is more valuable than moderate accuracy on everything. Use abstain liberally for uncertain cases.
Six types of labeling functions provide complementary signals: keyword, pattern/regex, heuristic, external knowledge base, model-based, and LLM-based. Diversity of function types is critical.
The Snorkel label model outperforms majority voting by 5 to 15 points because it learns function accuracies and correlation structure rather than treating all votes equally.
LLM labels are a powerful labeling function with uniquely high coverage but correlated errors. Combining LLMs with traditional labeling functions in a weak supervision framework outperforms either approach alone.
Optimize for iteration speed, not single-round accuracy. Cheaper, faster labeling approaches that enable rapid iteration often produce better final models than expensive one-shot labeling campaigns.