Constitutional AI and RLAIF

RLHF works but it has a labour problem. Harmlessness preferences require humans to read disturbing prompts and rank disturbing outputs. The labellers burn out, the labels are inconsistent across demographics, and you cannot afford to relabel every time you tweak the policy. Constitutional AI (Bai et al, Anthropic 2022) replaces the harmlessness half of RLHF with a written set of principles and a model that critiques itself against them.

The two-phase training loop

Phase 1: Supervised - critique and revise.

Starting from an SFT-helpful model:

Sample a harmful prompt.
Generate an initial (potentially harmful) response.
Prompt the same model: "Identify ways in which the response is harmful, unethical, racist, etc. according to [principle from constitution]."
Prompt again: "Rewrite the response to remove the harmful content."
Fine-tune on (prompt, revised response) pairs.

You iterate this for several principles and several rounds. The principles are explicit text, e.g. "Please choose the response that is the most helpful, honest, and harmless" or "Choose the response that a wise, ethical, polite and friendly person would more likely say".

Phase 2: Reinforcement learning from AI feedback (RLAIF).

For each prompt, sample two responses from the SL-trained model.
Prompt a separate AI model with a constitutional principle and ask it to pick the better response.
Train a preference model on these AI-generated comparisons.
Run RL (PPO) against the preference model, with KL penalty against the SL model.

You have replaced the human pairwise comparisons in standard RLHF with model-generated ones. The reward model is now trained on AI labels, not human labels - hence RLAIF.

Why a written constitution matters

Standard RLHF encodes harmlessness implicitly in the weights of the reward model. If you ask "why did the model refuse that?" the answer is "because the reward model gave the refusal a higher score." That is not auditable.

CAI encodes harmlessness as text. The principles can be:

Inspected (publish the constitution, debate the wording).
Diffed across releases (this principle was added in v2 after observing X).
Edited surgically (loosen this principle, tighten that one, rerun the loop).
Stress-tested (does the model actually behave consistently with principle 7?).

Anthropic publish their working constitution. The principles include excerpts from the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules. The mixing is deliberate - it teaches the preference model what kind of cross-cultural, multi-perspective trade-off to make.

How it differs from and complements RLHF

	RLHF	CAI / RLAIF
Helpfulness labels	Human	Human (CAI keeps this)
Harmlessness labels	Human	AI critique against principles
Label cost	High, slow	Low, fast
Auditability of policy	Implicit in reward weights	Explicit in constitution text
Risk of label drift	Per-labeller variance	Per-prompt template variance
Scales with model capability	Bottlenecked on humans	Improves as critic model improves

The two-phase training loop

Why a written constitution matters

How it differs from and complements RLHF

Keep reading with Pro.