Constitutional AI replaces thousands of human preference labels with a small set of written principles. Instead of hiring annotators to judge every response pair, CAI asks the model itself to critique and revise its outputs according to a "constitution" of behavioral rules. The model generates, self-critiques, revises, and then these revised outputs serve as training data. This approach (developed by Anthropic) dramatically reduces the cost of alignment data collection and allows alignment behavior to be specified declaratively through principles rather than implicitly through examples.
1. The Human Annotation Bottleneck
Standard RLHF requires large volumes of human preference data. OpenAI's InstructGPT used roughly 33,000 human comparisons. As models grow more capable, the annotation challenge intensifies: annotators need domain expertise to evaluate complex outputs, agreement rates drop on subtle quality distinctions, and the cost per comparison rises. Furthermore, human preferences are inherently inconsistent; different annotators often disagree on which response is better, and individual annotators may be inconsistent across sessions.
Constitutional AI addresses this bottleneck by replacing most human annotation with AI-generated feedback. The key insight is that a sufficiently capable model can evaluate its own outputs against explicit principles, and these self-evaluations can serve as a training signal. The human role shifts from labeling individual examples to writing the principles (the "constitution") that guide evaluation.
2. The Constitutional AI Framework
Constitutional AI operates in two phases. The first phase generates training data through self-critique and revision. The second phase trains a preference model on AI-generated comparisons, replacing the human labelers in standard RLHF.
2.1 Phase 1: Critique-Revision Pairs
In Phase 1, the model is presented with potentially harmful or low-quality prompts and generates an initial response. The model then critiques its own response against a specific constitutional principle and produces a revised version. This critique-revision loop can be repeated multiple times, producing progressively better responses.
# Constitutional AI: Phase 1 - Self-Critique and Revision
from dataclasses import dataclass
from typing import List
@dataclass
class ConstitutionalPrinciple:
name: str
critique_prompt: str
revision_prompt: str
# Example constitution (simplified from Anthropic's approach)
CONSTITUTION = [
ConstitutionalPrinciple(
name="helpfulness",
critique_prompt=(
"Identify specific ways in which the assistant's response "
"is unhelpful, incomplete, or fails to address the user's "
"actual question."
),
revision_prompt=(
"Revise the response to be more helpful, complete, and "
"directly address the user's question."
),
),
ConstitutionalPrinciple(
name="harmlessness",
critique_prompt=(
"Identify any content in the response that could be "
"harmful, dangerous, unethical, or that provides "
"instructions for illegal activities."
),
revision_prompt=(
"Revise the response to remove harmful content while "
"still being as helpful as possible for legitimate uses."
),
),
ConstitutionalPrinciple(
name="honesty",
critique_prompt=(
"Identify any claims in the response that are likely "
"false, misleading, or presented with unwarranted "
"confidence. Note where uncertainty should be expressed."
),
revision_prompt=(
"Revise the response to be more truthful, express "
"appropriate uncertainty, and avoid presenting "
"speculation as fact."
),
),
]
def critique_and_revise(model, tokenizer, prompt, response, principle):
"""Apply one critique-revision step using a constitutional principle."""
# Step 1: Generate critique
critique_input = (
f"Here is a conversation:\n\n"
f"Human: {prompt}\n"
f"Assistant: {response}\n\n"
f"Critique request: {principle.critique_prompt}\n"
f"Critique:"
)
critique = model.generate(tokenizer.encode(critique_input))
# Step 2: Generate revision
revision_input = (
f"Here is a conversation:\n\n"
f"Human: {prompt}\n"
f"Assistant: {response}\n\n"
f"Critique: {critique}\n\n"
f"Revision request: {principle.revision_prompt}\n"
f"Revised response:"
)
revised = model.generate(tokenizer.encode(revision_input))
return {"critique": critique, "revised_response": revised}
def build_cai_sft_dataset(model, tokenizer, prompts, constitution, rounds=3):
"""Build SFT data from iterative critique-revision."""
import random
sft_data = []
for prompt in prompts:
# Generate initial (potentially problematic) response
response = model.generate(tokenizer.encode(prompt))
# Apply multiple rounds of critique-revision
for _ in range(rounds):
principle = random.choice(constitution)
result = critique_and_revise(
model, tokenizer, prompt, response, principle
)
response = result["revised_response"]
# Final revised response becomes the SFT target
sft_data.append({"prompt": prompt, "response": response})
return sft_data
2.2 Phase 2: RLAIF (RL from AI Feedback)
In Phase 2, the model acts as a preference annotator. Given a prompt and two candidate responses, the model is asked which response better adheres to the constitutional principles. These AI-generated preferences replace human preference labels in the standard RLHF pipeline. The resulting preference dataset can be used to train a reward model (for PPO) or directly for DPO training.
# Constitutional AI: Phase 2 - RLAIF Preference Generation
import random
def generate_ai_preference(
judge_model,
tokenizer,
prompt: str,
response_a: str,
response_b: str,
principles: List[ConstitutionalPrinciple],
) -> dict:
"""Use the model itself to judge which response is better."""
# Select a random principle for this comparison
principle = random.choice(principles)
judge_prompt = (
f"Consider the following principle: {principle.name}\n"
f"{principle.critique_prompt}\n\n"
f"Human: {prompt}\n\n"
f"Response A: {response_a}\n\n"
f"Response B: {response_b}\n\n"
f"Which response better follows the principle above? "
f"Answer with just 'A' or 'B' and explain briefly."
)
judgment = judge_model.generate(
tokenizer.encode(judge_prompt),
max_new_tokens=100,
temperature=0.0,
)
judgment_text = tokenizer.decode(judgment)
# Parse the judgment
winner = "A" if judgment_text.strip().startswith("A") else "B"
if winner == "A":
return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
else:
return {"prompt": prompt, "chosen": response_b, "rejected": response_a}
def build_rlaif_dataset(
model, tokenizer, prompts, principles, samples_per_prompt=4
):
"""Build a full RLAIF preference dataset."""
import itertools
preference_pairs = []
for prompt in prompts:
# Generate multiple candidate responses
responses = []
for _ in range(samples_per_prompt):
resp = model.generate(
tokenizer.encode(prompt),
do_sample=True,
temperature=0.8,
)
responses.append(tokenizer.decode(resp))
# Create pairwise comparisons
for a, b in itertools.combinations(responses, 2):
pair = generate_ai_preference(
model, tokenizer, prompt, a, b, principles
)
preference_pairs.append(pair)
return preference_pairs
The power of CAI is that alignment behavior becomes declarative. Instead of implicitly defining "good behavior" through thousands of labeled examples, you explicitly state the rules. This makes alignment auditable (you can read the constitution), modifiable (change a principle, retrain), and transparent (the model's self-critiques explain its reasoning). The downside is that the constitution must be carefully written; vague or contradictory principles produce inconsistent behavior.
3. RLAIF: Scaling AI Feedback
RLAIF (Reinforcement Learning from AI Feedback) generalizes the CAI approach. Any strong model can serve as the feedback provider, not just the model being trained. Google's research showed that RLAIF can match or exceed RLHF quality when the AI feedback provider is sufficiently capable. This opens up a scalable pipeline where a frontier model provides the preference signal for training smaller models.
| Aspect | RLHF | RLAIF / CAI |
|---|---|---|
| Feedback source | Human annotators | AI model (self or stronger model) |
| Cost per comparison | $0.50 to $5.00 | $0.001 to $0.01 (API cost) |
| Throughput | 100s per day per annotator | 10,000s per hour |
| Consistency | Variable (inter-annotator disagreement) | High (deterministic at temp=0) |
| Bias risk | Cultural, demographic, personal | Model-specific (verbosity, sycophancy) |
| Domain coverage | Limited by annotator expertise | Broad but shallow understanding |
| Adaptability | Slow (retrain annotators) | Fast (update principles) |
4. Self-Play and Iterative Self-Improvement
Beyond single-round CAI, researchers have explored iterative self-improvement where a model's outputs from one training round become the training data for the next. This creates a self-play dynamic similar to AlphaGo's self-improvement through self-play games.
# Iterative Self-Improvement Pipeline
def iterative_self_improvement(
base_model_path: str,
constitution: List[ConstitutionalPrinciple],
prompts: List[str],
num_rounds: int = 3,
eval_fn=None,
):
"""Run multiple rounds of self-improvement."""
from transformers import AutoModelForCausalLM, AutoTokenizer
current_model_path = base_model_path
results_per_round = []
for round_num in range(num_rounds):
print(f"Round {round_num + 1}/{num_rounds}")
model = AutoModelForCausalLM.from_pretrained(current_model_path)
tokenizer = AutoTokenizer.from_pretrained(current_model_path)
# Phase 1: Generate critique-revision SFT data
sft_data = build_cai_sft_dataset(
model, tokenizer, prompts, constitution, rounds=2
)
print(f" Generated {len(sft_data)} SFT examples")
# Phase 2: Generate RLAIF preferences
pref_data = build_rlaif_dataset(
model, tokenizer, prompts, constitution
)
print(f" Generated {len(pref_data)} preference pairs")
# Train: SFT on revised responses, then DPO on preferences
new_model_path = f"./cai-round-{round_num + 1}"
train_sft(model, sft_data, output_dir=f"{new_model_path}-sft")
train_dpo(
f"{new_model_path}-sft", pref_data,
output_dir=new_model_path
)
# Evaluate
if eval_fn:
metrics = eval_fn(new_model_path)
results_per_round.append(metrics)
print(f" Eval: {metrics}")
# Early stopping if quality degrades
if round_num > 0:
prev = results_per_round[-2]
curr = results_per_round[-1]
if curr["quality"] < prev["quality"] * 0.95:
print(" Quality degradation detected, stopping.")
break
current_model_path = new_model_path
return current_model_path, results_per_round
5. The Alignment Tax
A persistent concern in alignment research is the "alignment tax": the cost in general capabilities that alignment training imposes. Models trained with RLHF or CAI sometimes perform worse on benchmarks that measure raw knowledge, reasoning, or coding ability compared to their unaligned base models. This creates a tension between safety and capability.
The alignment tax is real but often overstated. Careful alignment training with appropriate KL penalties preserves most general capabilities. The bigger risk is over-alignment, where the model becomes excessively cautious, refusing legitimate requests or hedging every statement with unnecessary disclaimers. Finding the right balance requires continuous evaluation across both safety and capability benchmarks.
5.1 Measuring the Alignment Tax
# Measuring alignment tax across capability dimensions
from dataclasses import dataclass
from typing import Dict
@dataclass
class AlignmentTaxReport:
model_name: str
base_scores: Dict[str, float]
aligned_scores: Dict[str, float]
def compute_tax(self) -> Dict[str, float]:
"""Compute per-benchmark alignment tax."""
tax = {}
for benchmark in self.base_scores:
base = self.base_scores[benchmark]
aligned = self.aligned_scores.get(benchmark, 0)
tax[benchmark] = (base - aligned) / base * 100
return tax
def report(self):
tax = self.compute_tax()
print(f"Alignment Tax Report: {self.model_name}")
print("-" * 55)
for bench, pct in tax.items():
direction = "regression" if pct > 0 else "improvement"
print(f" {bench:25s}: {abs(pct):5.1f}% {direction}")
avg_tax = sum(tax.values()) / len(tax)
print(f" {'Average tax':25s}: {avg_tax:5.1f}%")
# Example: comparing base vs. aligned model
report = AlignmentTaxReport(
model_name="Llama-3.1-8B-Instruct vs Base",
base_scores={
"MMLU": 65.2,
"HumanEval": 42.1,
"GSM8K": 56.8,
"TruthfulQA": 38.5,
"HellaSwag": 78.3,
},
aligned_scores={
"MMLU": 63.8, # small regression
"HumanEval": 40.5, # small regression
"GSM8K": 58.2, # improvement (instruction following helps)
"TruthfulQA": 52.1, # large improvement (alignment goal)
"HellaSwag": 76.9, # small regression
},
)
report.report()
6. Shallow Safety Alignment
Research has revealed a concerning phenomenon: safety alignment in current models may be more superficial than it appears. Studies have shown that safety training can be undone with minimal fine-tuning (sometimes as few as 10 to 100 examples of harmful content), suggesting that alignment modifies surface-level behavior rather than deeply changing the model's representations.
The fragility of safety alignment has significant implications for open-weight models. If alignment can be reversed with trivial fine-tuning, then releasing aligned open-weight models provides only a modest speed bump against misuse. This observation motivates research into more robust alignment methods that modify deeper representations, as well as complementary approaches like inference-time guardrails and output filtering.
Current alignment techniques (RLHF, DPO, CAI) primarily teach the model when to refuse rather than removing the underlying capability to generate harmful content. This is analogous to teaching someone not to pick locks rather than making them forget how locks work. True robust alignment likely requires deeper modifications to model representations, which is an active area of research in mechanistic interpretability (Module 17).
📝 Section Quiz
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
✅ Key Takeaways
- Constitutional AI replaces human preference annotation with AI self-critique guided by explicit written principles, reducing cost by 100x or more.
- The two-phase approach (self-critique for SFT data, AI judging for preferences) can match RLHF quality when the AI feedback provider is capable enough.
- RLAIF generalizes CAI by using any strong model as a feedback provider, enabling scalable alignment data generation.
- Iterative self-improvement shows initial gains but faces diminishing returns and potential capability degradation after 2 to 3 rounds.
- The alignment tax is real but manageable; well-tuned alignment preserves most capabilities while improving safety benchmarks significantly.
- Current alignment is shallow: safety behavior can be removed with minimal fine-tuning, motivating research into more robust alignment approaches.