Section 16.3: Constitutional AI & Self-Alignment

★ Big Picture

Constitutional AI replaces thousands of human preference labels with a small set of written principles. Instead of hiring annotators to judge every response pair, CAI asks the model itself to critique and revise its outputs according to a "constitution" of behavioral rules. The model generates, self-critiques, revises, and then these revised outputs serve as training data. This approach (developed by Anthropic) dramatically reduces the cost of alignment data collection and allows alignment behavior to be specified declaratively through principles rather than implicitly through examples.

1. The Human Annotation Bottleneck

Standard RLHF requires large volumes of human preference data. OpenAI's InstructGPT used roughly 33,000 human comparisons. As models grow more capable, the annotation challenge intensifies: annotators need domain expertise to evaluate complex outputs, agreement rates drop on subtle quality distinctions, and the cost per comparison rises. Furthermore, human preferences are inherently inconsistent; different annotators often disagree on which response is better, and individual annotators may be inconsistent across sessions.

Constitutional AI addresses this bottleneck by replacing most human annotation with AI-generated feedback. The key insight is that a sufficiently capable model can evaluate its own outputs against explicit principles, and these self-evaluations can serve as a training signal. The human role shifts from labeling individual examples to writing the principles (the "constitution") that guide evaluation.

2. The Constitutional AI Framework

Constitutional AI operates in two phases. The first phase generates training data through self-critique and revision. The second phase trains a preference model on AI-generated comparisons, replacing the human labelers in standard RLHF.

Figure 16.6: The Constitutional AI pipeline. Phase 1 uses self-critique and revision to generate SFT data. Phase 2 uses AI-generated preference judgments (RLAIF) to train the reward model or run DPO.

2.1 Phase 1: Critique-Revision Pairs

In Phase 1, the model is presented with potentially harmful or low-quality prompts and generates an initial response. The model then critiques its own response against a specific constitutional principle and produces a revised version. This critique-revision loop can be repeated multiple times, producing progressively better responses.

# Constitutional AI: Phase 1 - Self-Critique and Revision
from dataclasses import dataclass
from typing import List

@dataclass
class ConstitutionalPrinciple:
    name: str
    critique_prompt: str
    revision_prompt: str

# Example constitution (simplified from Anthropic's approach)
CONSTITUTION = [
    ConstitutionalPrinciple(
        name="helpfulness",
        critique_prompt=(
            "Identify specific ways in which the assistant's response "
            "is unhelpful, incomplete, or fails to address the user's "
            "actual question."
        ),
        revision_prompt=(
            "Revise the response to be more helpful, complete, and "
            "directly address the user's question."
        ),
    ),
    ConstitutionalPrinciple(
        name="harmlessness",
        critique_prompt=(
            "Identify any content in the response that could be "
            "harmful, dangerous, unethical, or that provides "
            "instructions for illegal activities."
        ),
        revision_prompt=(
            "Revise the response to remove harmful content while "
            "still being as helpful as possible for legitimate uses."
        ),
    ),
    ConstitutionalPrinciple(
        name="honesty",
        critique_prompt=(
            "Identify any claims in the response that are likely "
            "false, misleading, or presented with unwarranted "
            "confidence. Note where uncertainty should be expressed."
        ),
        revision_prompt=(
            "Revise the response to be more truthful, express "
            "appropriate uncertainty, and avoid presenting "
            "speculation as fact."
        ),
    ),
]

def critique_and_revise(model, tokenizer, prompt, response, principle):
    """Apply one critique-revision step using a constitutional principle."""
    # Step 1: Generate critique
    critique_input = (
        f"Here is a conversation:\n\n"
        f"Human: {prompt}\n"
        f"Assistant: {response}\n\n"
        f"Critique request: {principle.critique_prompt}\n"
        f"Critique:"
    )
    critique = model.generate(tokenizer.encode(critique_input))

    # Step 2: Generate revision
    revision_input = (
        f"Here is a conversation:\n\n"
        f"Human: {prompt}\n"
        f"Assistant: {response}\n\n"
        f"Critique: {critique}\n\n"
        f"Revision request: {principle.revision_prompt}\n"
        f"Revised response:"
    )
    revised = model.generate(tokenizer.encode(revision_input))

    return {"critique": critique, "revised_response": revised}

def build_cai_sft_dataset(model, tokenizer, prompts, constitution, rounds=3):
    """Build SFT data from iterative critique-revision."""
    import random
    sft_data = []

    for prompt in prompts:
        # Generate initial (potentially problematic) response
        response = model.generate(tokenizer.encode(prompt))

        # Apply multiple rounds of critique-revision
        for _ in range(rounds):
            principle = random.choice(constitution)
            result = critique_and_revise(
                model, tokenizer, prompt, response, principle
            )
            response = result["revised_response"]

        # Final revised response becomes the SFT target
        sft_data.append({"prompt": prompt, "response": response})

    return sft_data

2.2 Phase 2: RLAIF (RL from AI Feedback)

In Phase 2, the model acts as a preference annotator. Given a prompt and two candidate responses, the model is asked which response better adheres to the constitutional principles. These AI-generated preferences replace human preference labels in the standard RLHF pipeline. The resulting preference dataset can be used to train a reward model (for PPO) or directly for DPO training.

# Constitutional AI: Phase 2 - RLAIF Preference Generation
import random

def generate_ai_preference(
    judge_model,
    tokenizer,
    prompt: str,
    response_a: str,
    response_b: str,
    principles: List[ConstitutionalPrinciple],
) -> dict:
    """Use the model itself to judge which response is better."""

    # Select a random principle for this comparison
    principle = random.choice(principles)

    judge_prompt = (
        f"Consider the following principle: {principle.name}\n"
        f"{principle.critique_prompt}\n\n"
        f"Human: {prompt}\n\n"
        f"Response A: {response_a}\n\n"
        f"Response B: {response_b}\n\n"
        f"Which response better follows the principle above? "
        f"Answer with just 'A' or 'B' and explain briefly."
    )

    judgment = judge_model.generate(
        tokenizer.encode(judge_prompt),
        max_new_tokens=100,
        temperature=0.0,
    )
    judgment_text = tokenizer.decode(judgment)

    # Parse the judgment
    winner = "A" if judgment_text.strip().startswith("A") else "B"

    if winner == "A":
        return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
    else:
        return {"prompt": prompt, "chosen": response_b, "rejected": response_a}


def build_rlaif_dataset(
    model, tokenizer, prompts, principles, samples_per_prompt=4
):
    """Build a full RLAIF preference dataset."""
    import itertools

    preference_pairs = []

    for prompt in prompts:
        # Generate multiple candidate responses
        responses = []
        for _ in range(samples_per_prompt):
            resp = model.generate(
                tokenizer.encode(prompt),
                do_sample=True,
                temperature=0.8,
            )
            responses.append(tokenizer.decode(resp))

        # Create pairwise comparisons
        for a, b in itertools.combinations(responses, 2):
            pair = generate_ai_preference(
                model, tokenizer, prompt, a, b, principles
            )
            preference_pairs.append(pair)

    return preference_pairs

💡 Key Insight

The power of CAI is that alignment behavior becomes declarative. Instead of implicitly defining "good behavior" through thousands of labeled examples, you explicitly state the rules. This makes alignment auditable (you can read the constitution), modifiable (change a principle, retrain), and transparent (the model's self-critiques explain its reasoning). The downside is that the constitution must be carefully written; vague or contradictory principles produce inconsistent behavior.

3. RLAIF: Scaling AI Feedback

RLAIF (Reinforcement Learning from AI Feedback) generalizes the CAI approach. Any strong model can serve as the feedback provider, not just the model being trained. Google's research showed that RLAIF can match or exceed RLHF quality when the AI feedback provider is sufficiently capable. This opens up a scalable pipeline where a frontier model provides the preference signal for training smaller models.

Aspect	RLHF	RLAIF / CAI
Feedback source	Human annotators	AI model (self or stronger model)
Cost per comparison	$0.50 to $5.00	$0.001 to $0.01 (API cost)
Throughput	100s per day per annotator	10,000s per hour
Consistency	Variable (inter-annotator disagreement)	High (deterministic at temp=0)
Bias risk	Cultural, demographic, personal	Model-specific (verbosity, sycophancy)
Domain coverage	Limited by annotator expertise	Broad but shallow understanding
Adaptability	Slow (retrain annotators)	Fast (update principles)

4. Self-Play and Iterative Self-Improvement

Beyond single-round CAI, researchers have explored iterative self-improvement where a model's outputs from one training round become the training data for the next. This creates a self-play dynamic similar to AlphaGo's self-improvement through self-play games.

Figure 16.7: Iterative self-improvement through multiple rounds of self-critique and training. Quality improves initially but faces diminishing returns and potential degradation with excessive iterations.

# Iterative Self-Improvement Pipeline
def iterative_self_improvement(
    base_model_path: str,
    constitution: List[ConstitutionalPrinciple],
    prompts: List[str],
    num_rounds: int = 3,
    eval_fn=None,
):
    """Run multiple rounds of self-improvement."""
    from transformers import AutoModelForCausalLM, AutoTokenizer

    current_model_path = base_model_path
    results_per_round = []

    for round_num in range(num_rounds):
        print(f"Round {round_num + 1}/{num_rounds}")

        model = AutoModelForCausalLM.from_pretrained(current_model_path)
        tokenizer = AutoTokenizer.from_pretrained(current_model_path)

        # Phase 1: Generate critique-revision SFT data
        sft_data = build_cai_sft_dataset(
            model, tokenizer, prompts, constitution, rounds=2
        )
        print(f"  Generated {len(sft_data)} SFT examples")

        # Phase 2: Generate RLAIF preferences
        pref_data = build_rlaif_dataset(
            model, tokenizer, prompts, constitution
        )
        print(f"  Generated {len(pref_data)} preference pairs")

        # Train: SFT on revised responses, then DPO on preferences
        new_model_path = f"./cai-round-{round_num + 1}"
        train_sft(model, sft_data, output_dir=f"{new_model_path}-sft")
        train_dpo(
            f"{new_model_path}-sft", pref_data,
            output_dir=new_model_path
        )

        # Evaluate
        if eval_fn:
            metrics = eval_fn(new_model_path)
            results_per_round.append(metrics)
            print(f"  Eval: {metrics}")

            # Early stopping if quality degrades
            if round_num > 0:
                prev = results_per_round[-2]
                curr = results_per_round[-1]
                if curr["quality"] < prev["quality"] * 0.95:
                    print("  Quality degradation detected, stopping.")
                    break

        current_model_path = new_model_path

    return current_model_path, results_per_round

5. The Alignment Tax

A persistent concern in alignment research is the "alignment tax": the cost in general capabilities that alignment training imposes. Models trained with RLHF or CAI sometimes perform worse on benchmarks that measure raw knowledge, reasoning, or coding ability compared to their unaligned base models. This creates a tension between safety and capability.

⚠ Warning

The alignment tax is real but often overstated. Careful alignment training with appropriate KL penalties preserves most general capabilities. The bigger risk is over-alignment, where the model becomes excessively cautious, refusing legitimate requests or hedging every statement with unnecessary disclaimers. Finding the right balance requires continuous evaluation across both safety and capability benchmarks.

5.1 Measuring the Alignment Tax

# Measuring alignment tax across capability dimensions
from dataclasses import dataclass
from typing import Dict

@dataclass
class AlignmentTaxReport:
    model_name: str
    base_scores: Dict[str, float]
    aligned_scores: Dict[str, float]

    def compute_tax(self) -> Dict[str, float]:
        """Compute per-benchmark alignment tax."""
        tax = {}
        for benchmark in self.base_scores:
            base = self.base_scores[benchmark]
            aligned = self.aligned_scores.get(benchmark, 0)
            tax[benchmark] = (base - aligned) / base * 100
        return tax

    def report(self):
        tax = self.compute_tax()
        print(f"Alignment Tax Report: {self.model_name}")
        print("-" * 55)
        for bench, pct in tax.items():
            direction = "regression" if pct > 0 else "improvement"
            print(f"  {bench:25s}: {abs(pct):5.1f}% {direction}")
        avg_tax = sum(tax.values()) / len(tax)
        print(f"  {'Average tax':25s}: {avg_tax:5.1f}%")

# Example: comparing base vs. aligned model
report = AlignmentTaxReport(
    model_name="Llama-3.1-8B-Instruct vs Base",
    base_scores={
        "MMLU": 65.2,
        "HumanEval": 42.1,
        "GSM8K": 56.8,
        "TruthfulQA": 38.5,
        "HellaSwag": 78.3,
    },
    aligned_scores={
        "MMLU": 63.8,         # small regression
        "HumanEval": 40.5,    # small regression
        "GSM8K": 58.2,        # improvement (instruction following helps)
        "TruthfulQA": 52.1,   # large improvement (alignment goal)
        "HellaSwag": 76.9,    # small regression
    },
)
report.report()

Alignment Tax Report: Llama-3.1-8B-Instruct vs Base ------------------------------------------------------- MMLU : 2.1% regression HumanEval : 3.8% regression GSM8K : 2.5% improvement TruthfulQA : 35.3% improvement HellaSwag : 1.8% regression Average tax : -5.8%

6. Shallow Safety Alignment

Research has revealed a concerning phenomenon: safety alignment in current models may be more superficial than it appears. Studies have shown that safety training can be undone with minimal fine-tuning (sometimes as few as 10 to 100 examples of harmful content), suggesting that alignment modifies surface-level behavior rather than deeply changing the model's representations.

📝 Note

The fragility of safety alignment has significant implications for open-weight models. If alignment can be reversed with trivial fine-tuning, then releasing aligned open-weight models provides only a modest speed bump against misuse. This observation motivates research into more robust alignment methods that modify deeper representations, as well as complementary approaches like inference-time guardrails and output filtering.

Figure 16.8: Shallow alignment adds a thin safety layer that can be removed by fine-tuning. Deep alignment (the research goal) integrates safety into the model's core representations.

💡 Key Insight

Current alignment techniques (RLHF, DPO, CAI) primarily teach the model when to refuse rather than removing the underlying capability to generate harmful content. This is analogous to teaching someone not to pick locks rather than making them forget how locks work. True robust alignment likely requires deeper modifications to model representations, which is an active area of research in mechanistic interpretability (Module 17).

📝 Section Quiz

1. What are the two phases of Constitutional AI, and what does each produce?

Show Answer

Phase 1 (Supervised Self-Critique) generates SFT training data by having the model critique and revise its own responses against constitutional principles. Phase 2 (RLAIF) generates preference data by having the model judge which of two responses better adheres to the constitution. Phase 1 produces (prompt, revised_response) pairs for SFT. Phase 2 produces (prompt, chosen, rejected) triples for reward model training or DPO.

2. How does the human role differ between RLHF and Constitutional AI?

Show Answer

In RLHF, humans label individual preference pairs (comparing specific responses). In Constitutional AI, humans write the constitution (a set of high-level principles). The human effort shifts from labeling thousands of examples to crafting a small set of well-defined behavioral rules. This makes the alignment specification explicit, auditable, and modifiable.

3. What is the alignment tax, and how can it be measured?

Show Answer

The alignment tax is the reduction in general capabilities (knowledge, reasoning, coding) that results from alignment training. It is measured by comparing the aligned model's performance on standard benchmarks (MMLU, HumanEval, GSM8K) against the unaligned base model. A well-tuned alignment process minimizes this tax while maximizing safety improvements on benchmarks like TruthfulQA.

4. Why is shallow safety alignment a concern for open-weight models?

Show Answer

Research shows that safety alignment can be reversed with minimal fine-tuning (sometimes 10 to 100 harmful examples). For open-weight models where anyone can fine-tune, this means safety training provides only a modest barrier against misuse. The underlying harmful capabilities remain in the model's weights and can be re-exposed with trivial effort.

5. What advantage does RLAIF have over human-labeled RLHF in terms of cost and throughput?

Show Answer

RLAIF costs roughly $0.001 to $0.01 per comparison (API token cost) versus $0.50 to $5.00 for human annotation. Throughput increases from hundreds per annotator per day to tens of thousands per hour. RLAIF is also more consistent (no inter-annotator disagreement) and more easily adaptable (update the constitution). The tradeoff is that AI judges may have systematic biases different from human biases.

✅ Key Takeaways

Constitutional AI replaces human preference annotation with AI self-critique guided by explicit written principles, reducing cost by 100x or more.
The two-phase approach (self-critique for SFT data, AI judging for preferences) can match RLHF quality when the AI feedback provider is capable enough.
RLAIF generalizes CAI by using any strong model as a feedback provider, enabling scalable alignment data generation.
Iterative self-improvement shows initial gains but faces diminishing returns and potential capability degradation after 2 to 3 rounds.
The alignment tax is real but manageable; well-tuned alignment preserves most capabilities while improving safety benchmarks significantly.
Current alignment is shallow: safety behavior can be removed with minimal fine-tuning, motivating research into more robust alignment approaches.