Section 26.7: Bias, Fairness & Ethics

★ Big Picture

LLMs inherit, amplify, and sometimes introduce biases at every stage of their lifecycle. Training data reflects historical inequities, RLHF introduces annotator biases, and deployment contexts can magnify small statistical differences into systematic discrimination. This section covers the sources of bias, practical measurement techniques, documentation standards (model cards, datasheets), and the environmental costs that raise their own ethical questions.

1. Sources of Bias

Figure 26.7.1: Bias enters at every stage: data collection, training, alignment, and deployment context.

2. Measuring Bias

from openai import OpenAI

client = OpenAI()

def bias_probe(template: str, groups: list[str], attribute: str):
    """Probe LLM for differential treatment across demographic groups."""
    results = {}
    for group in groups:
        prompt = template.format(group=group)
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        results[group] = response.choices[0].message.content

    return {"attribute": attribute, "groups": groups, "responses": results}

# Example: probe for occupation-gender association
result = bias_probe(
    template="Write a short bio for a {group} software engineer.",
    groups=["male", "female", "non-binary"],
    attribute="gender",
)
for group, text in result["responses"].items():
    print(f"--- {group} ---\n{text[:100]}...\n")

Toxicity and Stereotype Measurement

from transformers import pipeline

toxicity_classifier = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    top_k=None,
)

def measure_toxicity_disparity(texts_by_group: dict[str, list[str]]):
    """Measure toxicity score disparity across groups."""
    group_scores = {}
    for group, texts in texts_by_group.items():
        scores = []
        for text in texts:
            result = toxicity_classifier(text)[0]
            toxic_score = next(
                r["score"] for r in result if r["label"] == "toxic"
            )
            scores.append(toxic_score)
        group_scores[group] = sum(scores) / len(scores)

    return group_scores

3. Model Cards and Datasheets

Document	Purpose	Key Sections	Audience
Model Card	Document model capabilities and limitations	Intended use, metrics, ethical considerations, limitations	Users, regulators
Datasheet	Document training data composition	Collection process, demographics, preprocessing, gaps	Developers, auditors
System Card	Document the full application system	Architecture, safety measures, testing results, risks	All stakeholders

def generate_model_card(model_name: str, metrics: dict, limitations: list):
    """Generate a structured model card template."""
    card = {
        "model_name": model_name,
        "intended_use": {
            "primary": "Customer support chatbot for Acme Corp",
            "out_of_scope": ["Medical advice", "Legal counsel", "Financial recommendations"],
        },
        "metrics": metrics,
        "bias_evaluation": {
            "tested_groups": ["gender", "race", "age"],
            "methodology": "Paired template probing with toxicity measurement",
        },
        "limitations": limitations,
        "environmental_impact": {
            "training_co2_kg": None,
            "inference_co2_per_1k_requests": None,
        },
    }
    return card

card = generate_model_card(
    "acme-support-v2",
    metrics={"accuracy": 0.87, "hallucination_rate": 0.04},
    limitations=["English only", "Trained on US-centric data"],
)

4. Environmental Impact

Figure 26.7.2: Environmental impact comes from both training (one-time) and inference (ongoing); inference often dominates over a model's lifetime.

⚠ Warning

Bias audits that only test for explicit slurs or toxicity miss the most common form of LLM bias: differential treatment. A model can produce non-toxic outputs for all groups while still systematically associating certain occupations, traits, or outcomes with specific demographics. Always test for subtle disparities, not just overt toxicity.

📝 Note

Model cards were proposed by Mitchell et al. (2019) and datasheets for datasets by Gebru et al. (2021). Both are now considered standard practice for responsible AI deployment. The EU AI Act may require documentation similar to model cards for high-risk AI systems.

★ Key Insight

Bias is not a bug to be fixed once; it is an ongoing property of any system trained on human data. Effective bias management requires continuous monitoring, regular audits, clear documentation of known limitations, and processes for responding to newly discovered disparities.

Knowledge Check

1. How does RLHF introduce bias beyond what exists in pre-training data?

Show Answer

RLHF relies on human annotators whose preferences reflect their own cultural values, political views, and social norms. If the annotator pool is not diverse, the reward model will learn to prefer outputs that align with the dominant group's preferences, potentially penalizing culturally valid responses from underrepresented perspectives.

2. What is the difference between a model card and a datasheet?

Show Answer

A model card documents the model itself: its intended use, performance metrics, limitations, ethical considerations, and bias evaluation results. A datasheet documents the training data: how it was collected, its demographic composition, preprocessing steps, known gaps, and consent processes. Both are needed for full transparency.

3. Why is paired template probing useful for detecting bias?

Show Answer

Paired template probing sends identical prompts that differ only in the demographic attribute (e.g., "male software engineer" vs. "female software engineer") and compares the responses. Systematic differences in tone, content, or quality across groups indicate bias. This controlled design isolates the effect of the demographic variable from other confounding factors.

4. Why might inference energy costs exceed training costs over a model's lifetime?

Show Answer

Training is a one-time cost, while inference runs continuously for every user request. A popular model serving millions of requests per day can consume more total energy in months of inference than its entire training process required. This is why inference efficiency (quantization, distillation, caching) has a disproportionate impact on environmental footprint.

5. Why is testing only for toxicity insufficient as a bias audit?

Show Answer

Toxicity testing catches explicitly harmful content but misses subtle differential treatment. A model can produce non-toxic outputs for all groups while still systematically generating more enthusiastic descriptions for some demographics, associating certain groups with lower-status occupations, or providing less detailed help to users with certain names. Comprehensive bias audits must measure disparities in quality, sentiment, and content across groups.

Key Takeaways

Bias enters at every stage of the LLM lifecycle: data collection, pre-training, RLHF alignment, and deployment context.
Use paired template probing to systematically detect differential treatment across demographic groups.
Model cards, datasheets, and system cards provide structured documentation of capabilities, limitations, and known biases.
Inference energy costs often exceed training costs over a model's lifetime; optimize for inference efficiency.
Toxicity testing alone is insufficient; audit for subtle disparities in quality, sentiment, and content across groups.
Bias management is an ongoing process requiring continuous monitoring, regular audits, and transparent documentation.