The Constitutional AI Data Loop

Anthropic's Claude models were trained on a feedback signal that was, in large part, written by Claude itself. Roughly 96 principles - the "constitution" - governed what the model was allowed to say, and the labelling work that would ordinarily require thousands of human annotators was delegated back to the model under instruction. That is the Constitutional AI (CAI) loop: a closed synthetic data pipeline where the policy model is simultaneously the student, the critic, and most of the grader.

The Two-Stage Architecture

CAI separates training into two sequential stages, each producing a distinct dataset.

Stage 1 - Supervised learning from revisions (SL-CAI).
The initial model (call it M_0) is shown a harmful or borderline prompt. It generates a response, then is immediately prompted to critique that response against a randomly sampled constitutional principle - for example, "Identify specific ways in which the assistant's last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal." The model produces a critique, then a revised response. That (prompt, final-revision) pair becomes a supervised fine-tuning example. After accumulating enough such pairs, M_0 is fine-tuned to produce M_SL. The supervision signal is pure self-play; no human labeller sees the examples.

Stage 2 - Reinforcement learning from AI feedback (RL-CAI / RLAIF).
M_SL generates pairs of responses to each prompt. A separate "feedback model" (in practice, a larger frozen LM) is prompted to choose which response better satisfies a constitutional principle framed as a multiple-choice question: "Which of these responses is less harmful?" Those AI preference labels train a preference model (PM). M_SL is then optimised against the PM reward via Proximal Policy Optimisation, yielding the final M_RL.

Prompt
  └─► M_0  ─── critique(principle) ──► revised response
                                              │
                       SFT on (prompt, revision) pairs
                                              │
                                          M_SL
                                              │
              M_SL generates response_A, response_B
                                              │
         Feedback LM labels: "A is better" / "B is better"
                                              │
                          Train preference model PM
                                              │
                        RL (PPO): optimise M_SL against PM
                                              │
                                          M_RL

Why the Constitution Matters

The word "constitutional" is precise: the principles are explicit, versioned, and auditable. This contrasts with RLHF, where human preferences are tacit and inconsistent across annotators. When a labeller chooses response A over B, you cannot inspect their reasoning; when a principle says "prefer the response that is least likely to contain false information," that criterion is visible in the training pipeline.

This has two practical consequences:

Scalable oversight. A single list of principles, applied by a capable LM, can label millions of preference pairs. Harrison Lee et al. (2024) showed that RLAIF achieves comparable win rates to RLHF on summarisation and dialogue tasks, and in some settings outperforms it.
Alignment transparency. Teams can diff the constitution between model versions the same way they diff code. Behavioural changes become partially traceable to principle changes.

The Two-Stage Architecture

Why the Constitution Matters

Keep reading with Pro.