Module 26 · Section 26.7

Bias, Fairness & Ethics

Sources of bias in LLMs, measurement techniques, fairness metrics, model cards, datasheets, and environmental impact
★ Big Picture

LLMs inherit, amplify, and sometimes introduce biases at every stage of their lifecycle. Training data reflects historical inequities, RLHF introduces annotator biases, and deployment contexts can magnify small statistical differences into systematic discrimination. This section covers the sources of bias, practical measurement techniques, documentation standards (model cards, datasheets), and the environmental costs that raise their own ethical questions.

1. Sources of Bias

Training Data Web crawl biases Representation gaps Pre-training Pattern amplification Frequency bias RLHF / Alignment Annotator values Cultural norms Deployment Prompt design User population
Figure 26.7.1: Bias enters at every stage: data collection, training, alignment, and deployment context.

2. Measuring Bias

from openai import OpenAI

client = OpenAI()

def bias_probe(template: str, groups: list[str], attribute: str):
    """Probe LLM for differential treatment across demographic groups."""
    results = {}
    for group in groups:
        prompt = template.format(group=group)
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        results[group] = response.choices[0].message.content

    return {"attribute": attribute, "groups": groups, "responses": results}

# Example: probe for occupation-gender association
result = bias_probe(
    template="Write a short bio for a {group} software engineer.",
    groups=["male", "female", "non-binary"],
    attribute="gender",
)
for group, text in result["responses"].items():
    print(f"--- {group} ---\n{text[:100]}...\n")

Toxicity and Stereotype Measurement

from transformers import pipeline

toxicity_classifier = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    top_k=None,
)

def measure_toxicity_disparity(texts_by_group: dict[str, list[str]]):
    """Measure toxicity score disparity across groups."""
    group_scores = {}
    for group, texts in texts_by_group.items():
        scores = []
        for text in texts:
            result = toxicity_classifier(text)[0]
            toxic_score = next(
                r["score"] for r in result if r["label"] == "toxic"
            )
            scores.append(toxic_score)
        group_scores[group] = sum(scores) / len(scores)

    return group_scores

3. Model Cards and Datasheets

DocumentPurposeKey SectionsAudience
Model CardDocument model capabilities and limitationsIntended use, metrics, ethical considerations, limitationsUsers, regulators
DatasheetDocument training data compositionCollection process, demographics, preprocessing, gapsDevelopers, auditors
System CardDocument the full application systemArchitecture, safety measures, testing results, risksAll stakeholders
def generate_model_card(model_name: str, metrics: dict, limitations: list):
    """Generate a structured model card template."""
    card = {
        "model_name": model_name,
        "intended_use": {
            "primary": "Customer support chatbot for Acme Corp",
            "out_of_scope": ["Medical advice", "Legal counsel", "Financial recommendations"],
        },
        "metrics": metrics,
        "bias_evaluation": {
            "tested_groups": ["gender", "race", "age"],
            "methodology": "Paired template probing with toxicity measurement",
        },
        "limitations": limitations,
        "environmental_impact": {
            "training_co2_kg": None,
            "inference_co2_per_1k_requests": None,
        },
    }
    return card

card = generate_model_card(
    "acme-support-v2",
    metrics={"accuracy": 0.87, "hallucination_rate": 0.04},
    limitations=["English only", "Trained on US-centric data"],
)

4. Environmental Impact

LLM Environmental Cost Factors Training GPU hours: 10K-1M+ CO2: 100-10,000+ tons Water: cooling systems One-time, amortized Inference Per-query energy cost Scales with usage Often > training total Ongoing, cumulative Mitigation Smaller models (distill) Quantization (4-bit) Efficient architectures Green data centers
Figure 26.7.2: Environmental impact comes from both training (one-time) and inference (ongoing); inference often dominates over a model's lifetime.
⚠ Warning

Bias audits that only test for explicit slurs or toxicity miss the most common form of LLM bias: differential treatment. A model can produce non-toxic outputs for all groups while still systematically associating certain occupations, traits, or outcomes with specific demographics. Always test for subtle disparities, not just overt toxicity.

📝 Note

Model cards were proposed by Mitchell et al. (2019) and datasheets for datasets by Gebru et al. (2021). Both are now considered standard practice for responsible AI deployment. The EU AI Act may require documentation similar to model cards for high-risk AI systems.

★ Key Insight

Bias is not a bug to be fixed once; it is an ongoing property of any system trained on human data. Effective bias management requires continuous monitoring, regular audits, clear documentation of known limitations, and processes for responding to newly discovered disparities.

Knowledge Check

1. How does RLHF introduce bias beyond what exists in pre-training data?

Show Answer
RLHF relies on human annotators whose preferences reflect their own cultural values, political views, and social norms. If the annotator pool is not diverse, the reward model will learn to prefer outputs that align with the dominant group's preferences, potentially penalizing culturally valid responses from underrepresented perspectives.

2. What is the difference between a model card and a datasheet?

Show Answer
A model card documents the model itself: its intended use, performance metrics, limitations, ethical considerations, and bias evaluation results. A datasheet documents the training data: how it was collected, its demographic composition, preprocessing steps, known gaps, and consent processes. Both are needed for full transparency.

3. Why is paired template probing useful for detecting bias?

Show Answer
Paired template probing sends identical prompts that differ only in the demographic attribute (e.g., "male software engineer" vs. "female software engineer") and compares the responses. Systematic differences in tone, content, or quality across groups indicate bias. This controlled design isolates the effect of the demographic variable from other confounding factors.

4. Why might inference energy costs exceed training costs over a model's lifetime?

Show Answer
Training is a one-time cost, while inference runs continuously for every user request. A popular model serving millions of requests per day can consume more total energy in months of inference than its entire training process required. This is why inference efficiency (quantization, distillation, caching) has a disproportionate impact on environmental footprint.

5. Why is testing only for toxicity insufficient as a bias audit?

Show Answer
Toxicity testing catches explicitly harmful content but misses subtle differential treatment. A model can produce non-toxic outputs for all groups while still systematically generating more enthusiastic descriptions for some demographics, associating certain groups with lower-status occupations, or providing less detailed help to users with certain names. Comprehensive bias audits must measure disparities in quality, sentiment, and content across groups.

Key Takeaways