Constitutional AI and RLAIF: Scaling Oversight Without Scaling Labels

In early 2022, aligning a language model meant hiring people. OpenAI's InstructGPT pipeline ran on tens of thousands of human comparisons: contractors read pairs of model outputs and marked which one was better, and that judgment, distilled into a reward model, was the only thing standing between a raw pretrained network and an assistant (Ouyang et al., 2022, Training Language Models to Follow Instructions with Human Feedback, arXiv:2203.02155). The method worked well enough that a 1.3B-parameter aligned model produced outputs that labelers preferred to those of the 175B GPT-3. It also had an obvious ceiling. Every improvement in behavior required more human attention, and human attention does not get cheaper.

Constitutional AI proposed a different trade. Instead of paying people to label harmfulness, you write down what harmful means, hand that document to the model, and let the model critique and rank its own outputs against it (Bai et al., 2022, Constitutional AI: Harmlessness from AI Feedback, arXiv:2212.08073). The human supervision collapses from thousands of per-example judgments into a few dozen written principles. The technique that does the heavy lifting in the second half, training on AI-generated preferences, became known as RLAIF: reinforcement learning from AI feedback.

Why this matters: Every aligned model you use was shaped by a reward signal. Whether that signal came from a person reading your prompt or from another model reading a rulebook changes who controls the model's values, how cheaply behavior can be tuned, and how far the alignment can outrun the humans who started it.

TL;DR

Constitutional AI replaces most human harmlessness labels with a written list of principles plus a model that critiques and revises its own outputs; the only human input is the constitution itself.
It runs in two phases: a supervised stage (self-critique and revision) that fixes the model's starting behavior, then an RL stage (RLAIF) that trains a preference model on AI-generated comparisons.
RLAIF matches RLHF on summarization and helpful dialogue, and beats it on harmlessness (88% vs 76% harmless responses in one head-to-head), showing AI feedback is not a strictly weaker signal (Lee et al., 2023, arXiv:2309.00267).
The core idea, a model grading outputs against written rules, generalized: OpenAI's Rule-Based Rewards put it in the GPT-4 safety stack, and Self-Rewarding Language Models pushed it into a fully closed self-improvement loop.
The hard part is not the algorithm but the constitution. A vague principle produces a vague reward, and biases in the judge model propagate silently into the policy.
The frontier is moving from lists of rules toward explained principles: Anthropic's 2026 constitution grew from roughly 2,700 words to about 23,000, trading enumerated do's and don'ts for reasoning the model can generalize from.

At a Glance

flowchart LR
  C[Written constitution] --> SL[Self-critique<br/>and revision]
  SL --> M[Revised model]
  M --> P[AI preference<br/>labels]
  P --> RM[Preference model]
  RM --> RL[RLAIF<br/>policy training]
  RL --> A[Aligned assistant]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class C blue
  class SL,P,RM,RL purple
  class M,A teal

The whole method is a way to turn a short text document into a training signal. Everything downstream is mechanics.

Before AI Feedback

The recipe that produced ChatGPT and its peers, reinforcement learning from human feedback, has three moving parts. Start with a pretrained model. Fine-tune it on demonstrations of good behavior (supervised fine-tuning, or SFT). Then collect human comparisons, train a reward model \(r_\phi(x, y)\) to predict which response a person would prefer, and optimize the policy against that reward with a KL penalty that keeps it from drifting too far from the SFT model. The objective the policy maximizes is:

\[\max_{\pi_\theta} \; \mathbb{E}_{x \sim D,\; y \sim \pi_\theta(\cdot \mid x)}\big[r_\phi(x, y)\big] - \beta\, \mathbb{D}_{\mathrm{KL}}\big[\pi_\theta(y \mid x) \,\|\, \pi_{\mathrm{ref}}(y \mid x)\big]\]

The \(\beta\) term is the leash. Without it, the policy finds degenerate outputs that score high on the reward model but read as gibberish to a human, a failure called reward hacking.

[IMAGE: Schematic of the classic RLHF objective, showing the reward term pulling the policy up and the KL penalty tethering it to the reference model, with a reward-hacked output drifting off to the side]

The bottleneck in this pipeline is the comparison data. InstructGPT's labelers were a carefully screened team, and the quality of the reward model tracked the quality of their judgments. For helpfulness this is tolerable. For harmlessness it is brutal: asking contractors to read a stream of model attempts at producing dangerous, hateful, or manipulative content is slow, expensive, and corrosive work, and it produces a model that tends to refuse bluntly rather than explain its reasoning.

[IMAGE: Side-by-side cost breakdown of an RLHF pipeline vs a Constitutional AI pipeline, with the human-labeling block shrinking from a tall bar to a thin sliver representing the constitution]

timeline
  title From human labels to written principles
  2022 : InstructGPT, RLHF at scale (arXiv 2203.02155)
  2022 : Constitutional AI and RLAIF (arXiv 2212.08073)
  2023 : DPO removes the RL loop (arXiv 2305.18290)
  2023 : RLAIF matches RLHF head-to-head (arXiv 2309.00267)
  2023 : Collective Constitutional AI, public-drafted rules
  2024 : Self-Rewarding LMs, the closed loop (arXiv 2401.10020)
  2024 : Rule-Based Rewards ship in GPT-4 (arXiv 2411.01111)
  2026 : Anthropic's 23k-word explained constitution

The arc is consistent: each step moves more of the judgment out of human hands and into a written artifact plus a model that reads it.

How Constitutional AI Actually Works

Constitutional AI runs in two phases that mirror the two phases of RLHF, but with AI substituted for the human at each labeling step.

Phase 1: Supervised learning by self-critique

The first phase repairs the model's behavior using nothing but the model and the constitution. The procedure is a loop over harmful prompts:

Prompt a helpful-only model (one trained to be helpful but not yet harmless) with a query designed to elicit a bad response. It complies, producing something objectionable.
Ask the same model to critique its own response against a randomly sampled principle from the constitution. The principle might read: Identify ways in which the response is harmful, unethical, or socially biased.
Ask the model to revise the response in light of its own critique.
Repeat the critique-and-revise step a few times, then keep the final revision.

The revised responses become a supervised fine-tuning dataset. The model is retrained on its own corrected outputs. The critique step matters: a model asked directly to "be harmless" often over-refuses, but a model asked to find specific problems and fix them produces responses that engage with the question while declining the harmful part. Chain-of-thought reasoning in the critique improves both the result and its transparency, because you can read why the model decided a response was bad.

sequenceDiagram
  participant U as Red-team prompt
  participant M as Helpful model
  participant Con as Constitution
  U->>M: Harmful request
  M->>U: Objectionable answer
  M->>Con: Sample a principle
  Con-->>M: "Critique for harm"
  M->>M: Self-critique
  M->>M: Revise response
  Note over M: Repeat critique/revise 1-4x
  M->>U: Harmless, non-evasive answer

Phase 2: Reinforcement learning from AI feedback (RLAIF)

The second phase is structurally identical to RLHF, with one substitution. To build the preference dataset, you sample two responses from the phase-1 model for each prompt, then ask a separate feedback model which response better satisfies a constitutional principle. That model's choice, often softened into a probability rather than a hard label, becomes the preference. You train a preference model on this AI-labeled dataset, then run standard RL (PPO) against it.

The feedback model is shown the principle, the prompt, and both candidate responses, and asked something like: Which of these responses is less harmful, more honest, and more in keeping with the principle above? Aggregating over many principles and many prompts yields a preference model that encodes the constitution's values without a single human comparison in the harmlessness data.

[IMAGE: The two-phase Constitutional AI pipeline as a labeled diagram, phase 1 (SL via critique/revise) feeding phase 2 (RLAIF preference modeling and PPO), with the constitution as a shared input to both]

The architecture below shows where each component sits and which signals flow where.

graph TD
  Con[Constitution<br/>principles] --> FB[Feedback model]
  Con --> CR[Critique/revise module]
  P0[Helpful-only model] --> CR
  CR --> SFT[SL-CAI model]
  SFT --> Gen[Sample response pairs]
  Gen --> FB
  FB --> PM[Preference model]
  PM --> PPO[PPO trainer]
  SFT --> PPO
  PPO --> Pol[RL-CAI policy]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class Con,P0 blue
  class CR,FB,PM,PPO purple
  class SFT,Pol teal
  class Gen slate

The elegant part is what the human does not do. A person writes the constitution and writes the red-team prompts. After that, the loop runs on model-generated critiques and model-generated preferences. The supervision is real, but it is concentrated in a document a single team can edit in an afternoon rather than spread across a labeling workforce.

Seeing It in Motion

Two views clarify what is actually happening during training. The first is the lifecycle of a single response as it moves through the critique loop.

stateDiagram-v2
  [*] --> Drafted
  Drafted --> Critiqued: sample principle
  Critiqued --> Revised: apply critique
  Revised --> Critiqued: still flawed
  Revised --> Accepted: clean
  Accepted --> [*]

The second is the decision the feedback model makes when it scores a pair, which is where AI judgment substitutes for human judgment.

flowchart TD
  S[Two candidate responses] --> Q{Which better<br/>fits the principle?}
  Q -->|Response A clearer| RA[Prefer A]
  Q -->|Response B safer| RB[Prefer B]
  Q -->|Too close| Soft[Soft label near 0.5]
  RA --> Lab[Preference label]
  RB --> Lab
  Soft --> Lab
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  class Q purple
  class Lab teal
  class Soft amber

The soft-label path is not a detail. Using the feedback model's probability rather than a forced binary choice carries the model's uncertainty into the preference dataset, which makes the resulting preference model better calibrated on genuinely ambiguous cases.

[IMAGE: A single harmful prompt traced through the critique-and-revise loop, with the draft, critique text, and revision shown as three stacked cards and an arrow looping back for the second pass]

By the Numbers

The central empirical question is whether AI feedback is as good as human feedback or merely cheaper. The most direct evidence comes from a controlled comparison across three tasks, with all final judgments made by human evaluators (Lee et al., 2023, RLAIF vs. RLHF, arXiv:2309.00267).

Task	RLAIF	RLHF	Notes
Summarization (win rate vs SFT)	71%	73%	Difference not statistically significant
Helpful dialogue (win rate vs SFT)	63%	64%	Difference not statistically significant
Harmless dialogue (% harmless)	88%	76%	RLAIF clearly ahead

The summarization and helpful-dialogue gaps are within noise, which is the headline result: AI feedback is not a degraded substitute. On harmlessness, the AI-labeled model was rated harmless more often, plausibly because a consistent rule applied by a model is more uniform than the judgments of a rotating human labeling pool. The same work showed RLAIF can beat a supervised baseline even when the AI labeler is the same size as the policy being trained, and introduced direct-RLAIF (d-RLAIF), which skips the separate preference model and reads rewards straight from an off-the-shelf model during RL.

A related line of work quantifies the rule-based variant of this idea. OpenAI's Rule-Based Rewards reported an F1 of 97.1 on a safety-behavior benchmark, against 91.7 for a human-feedback baseline, while using far less human data (Mu, Helyar et al., 2024, Rule Based Rewards for Language Model Safety, arXiv:2411.01111).

Method	Human labels needed	Reward source	Reported safety result
RLHF (InstructGPT-style)	High (tens of thousands of comparisons)	Human preference model	Baseline
Constitutional AI / RLAIF	Low (constitution + red-team prompts)	Model judging vs principles	88% harmless vs 76% (RLAIF vs RLHF)
Rule-Based Rewards	Low (rules + few-shot grader)	Fixed grader scoring rules	F1 97.1 vs 91.7 human baseline

The cost asymmetry is the point. Writing and revising a constitution is bounded human effort. Generating preferences from it scales with compute, which is the resource that does fall in price.

[IMAGE: Grouped bar chart of RLAIF vs RLHF win rates across the three tasks, with the harmlessness bars annotated to highlight the 88% vs 76% gap]

A Concrete Example

Walk one harmful prompt through phase 1 to see the state change. Suppose the red-team prompt is a request for help writing a message to pressure a coworker into covering up a mistake.

The helpful-only model, optimized only to be useful, drafts a persuasive manipulative message. Call this \(y_0\).

A principle is sampled at random from the constitution: Choose the response that is least likely to be used to deceive, manipulate, or pressure another person. The model is asked to critique \(y_0\) against it. Its critique (\(c_0\)) reads, roughly: This message coaches the reader to conceal an error and applies social pressure on a colleague. It facilitates deception and could harm the coworker and the organization.

The model revises. The new response (\(y_1\)) declines to write the pressuring message and instead offers to help draft an honest disclosure of the mistake and a plan to fix it. A second critique pass finds no remaining problem, so \(y_1\) is accepted into the SFT set.

The state of the example through the loop:

Step	Artifact	Content (summarized)
Draft	\(y_0\)	Manipulative message coaching a cover-up
Critique	\(c_0\)	Flags deception and coercion of a colleague
Revision	\(y_1\)	Declines, offers honest-disclosure help instead
Re-critique	\(c_1\)	No remaining harm found
Accept	\(y_1\)	Added to SFT dataset

Now phase 2. For a fresh prompt, the model samples two responses: \(y_A\), which refuses curtly, and \(y_B\), which declines but explains why and offers a constructive alternative. The feedback model is shown both plus the principle and asked which is better. It assigns roughly 0.85 probability to \(y_B\), because the constitution rewards being non-evasive as well as harmless. That 0.85 is the soft preference label. Thousands of such labels train the preference model, and PPO then nudges the policy toward the \(y_B\) style: helpful refusals over blunt ones. The model learns not just to avoid harm but to decline gracefully, and no human ever read either candidate.

Where It Breaks

The failure modes of Constitutional AI are mostly failures of the constitution and the judge, not of the optimizer.

The first is that a vague principle yields a vague reward. "Be helpful" and "avoid harm" pull in opposite directions on many prompts, and if the constitution does not say how to resolve the tension, the feedback model resolves it arbitrarily and inconsistently. The reward signal inherits that inconsistency.

The second is judge bias. The feedback model has its own quirks: a tendency to prefer longer answers, a preference for responses positioned first, and self-preference, the documented habit of models rating text in their own style more highly. These biases are not noise that averages out; they are systematic, so they get baked into the preference model and then into the policy. If the judge subtly favors verbose hedging, the trained model becomes verbose and hedging.

The third is value lock-in and narrowness of authorship. When a constitution is written by one team, that team's blind spots become the model's blind spots, applied at scale to everyone who uses it. This is the explicit motivation behind Collective Constitutional AI, in which roughly 1,000 US adults drafted principles through the Polis deliberation platform; the publicly sourced constitution overlapped only about 50% with the in-house one and leaned more toward promoting good behavior than prohibiting bad (Anthropic, 2023, Collective Constitutional AI: Aligning a Language Model with Public Input).

The fourth is reward hacking, which RLAIF does not cure. The policy can still find outputs that satisfy the letter of a principle while violating its spirit, and now the auditor is a model that may share the policy's blind spots rather than a human who might notice. The KL penalty constrains drift but does not guarantee the judge is measuring what you meant.

[IMAGE: Diagram of judge-model bias propagation, showing length bias and self-preference in the feedback model flowing into the preference model and then the policy]

Alternative Designs

Constitutional AI sits inside a wider design space of how to source and apply a preference signal. The alternatives are not strictly competitors; production systems often combine them.

Approach	Strengths	Weaknesses	Best when
RLHF (human preference model + PPO)	Grounded in real human judgment; well understood	Expensive labels; inconsistent labelers; corrosive for harmful content	Helpfulness and nuanced taste where human judgment is the gold standard
Constitutional AI / RLAIF	Cheap to scale; transparent values in a document; uniform	Quality capped by constitution and judge bias	Harmlessness and any behavior expressible as written principles
DPO	No RL loop, no separate reward model; stable and simple	Still needs preference pairs; less flexible than online RL	Teams wanting RLHF-quality alignment without PPO machinery
Rule-Based Rewards	Precise, auditable rules; tiny human footprint	Rules are brittle outside their authored scope	Safety behaviors with crisp specifications (refuse, do not lecture)
Self-Rewarding LMs	Closed loop; can exceed the labels it started with	Risk of compounding the model's own biases	Pushing past the ceiling of a fixed preference dataset

Direct Preference Optimization deserves special mention because it changed the surrounding machinery. DPO showed that the RLHF objective has a closed-form solution that lets you train directly on preference pairs with a simple classification loss, no reward model and no RL loop (Rafailov et al., 2023, Direct Preference Optimization, arXiv:2305.18290). RLAIF and DPO are orthogonal: RLAIF is about where preferences come from (a model, not a human), DPO is about how you optimize against them (directly, not via PPO). You can run RLAIF-generated preferences through DPO and get the cheapness of AI feedback with the simplicity of direct optimization.

Self-Rewarding Language Models close the loop entirely: a single model acts as both the response generator and, via LLM-as-judge prompting, its own reward signal, then trains on the preferences it generated and repeats. Three iterations on Llama 2 70B produced a model that outperformed several strong systems on AlpacaEval 2.0 (Yuan et al., 2024, Self-Rewarding Language Models, arXiv:2401.10020). This is RLAIF with the human removed even from authoring the judge prompt, and it makes the central risk vivid: a model improving against its own judgment can drift somewhere no human chose.

How It Is Used in Practice

The pattern moved from research into shipping systems quickly. Anthropic's Claude models are trained with Constitutional AI as a core method, and the constitution is a public artifact rather than an internal secret. OpenAI states that Rule-Based Rewards have been part of its safety stack since GPT-4 and GPT-4o, combining a rule-grading model with a helpful-only reward model so that safety behavior is enforced by auditable rules rather than only by collected human comparisons.

In practice the engineering concerns are concrete. The feedback model is an inference cost that scales with the size of the preference dataset, so teams cache judgments, batch them, and sometimes use a smaller judge than the policy. Constitutions are versioned like code, because changing a principle changes the reward and therefore the model's behavior, and you want to be able to attribute a behavior regression to a specific edit. Red-team prompt sets are curated and expanded continuously, since the supervised phase only fixes behaviors that some prompt elicited. And because the judge model's biases propagate, mature pipelines audit the judge separately, checking for length bias and position bias before trusting its labels.

The most consequential production shift is philosophical. Anthropic's 2026 constitution grew from roughly 2,700 words to about 23,000, and the change was not just length. The earlier version was largely a list of standalone principles; the new one explains the reasoning behind each, on the stated theory that a model which understands why a behavior is wanted generalizes better to situations the authors never anticipated (Anthropic, 2023, Claude's Constitution; Anthropic, 2026, Claude's new constitution). The document is released under a permissive license as a transparency artifact, so outside readers can distinguish intended behavior from bugs. The supervision is still concentrated in a text file. The bet is that explanation generalizes where enumeration does not.

[IMAGE: Timeline of constitution length and philosophy, 2,700-word list of rules in 2023 expanding to a 23,000-word explained document in 2026]

Insights Worth Remembering

The deepest idea here is that supervision can be compressed. A reward model is a lossy compression of human values learned from examples; a constitution is a different compression, learned from a written specification. Constitutional AI is the claim that for many behaviors the written form is both cheaper and more consistent.

AI feedback is not obviously weaker than human feedback, and on harmlessness it can be stronger, because a rule applied by a model is more uniform than a rule applied by a rotating pool of tired people.

The bottleneck moved but did not disappear. It went from the quantity of human labels to the quality of the constitution and the reliability of the judge. You traded a labor problem for a specification problem.

The judge is now part of the trust boundary. When a model grades a model, the grader's biases become training signal, so auditing the judge matters as much as auditing the data once did.

Constitutional AI made values legible. The thing that decides the model's behavior is a document a person can read, argue with, and edit, which is a real gain for governance even before you ask whether the model follows it perfectly.

The closer the loop, the higher the stakes. Self-rewarding setups can exceed the ceiling of a fixed dataset, and can also amplify a flaw nobody chose, with no human in the loop to catch it.

Open Questions

It is established that RLAIF matches RLHF on several tasks and that rule-based grading ships in production. What remains genuinely open is how far self-supervision can run before it degrades. Self-Rewarding and Meta-Rewarding results show improvement over a few iterations (Wu et al., 2024, Meta-Rewarding Language Models, arXiv:2407.19594), but whether the loop converges, plateaus, or quietly drifts over many iterations is not settled by current evidence.

Whether AI feedback can supervise a model substantially more capable than the judge, the scalable-oversight question, is unresolved. RLAIF works when the judge is competent on the task; the harder regime is judging outputs a human or a weaker model cannot fully evaluate, and the field does not yet have a reliable method there.

How to legitimately author a constitution is an open governance question as much as a technical one. Collective Constitutional AI showed that public input produces meaningfully different rules, but whose values a globally deployed model should encode, and through what process, has no consensus answer.

The bet on explanation over enumeration in the 2026 constitution is, as of now, a hypothesis. The claim that explained principles generalize better than listed ones is plausible and motivated, but the public evidence that a 23,000-word reasoned document yields measurably better generalization than a concise list is still thin, and worth watching rather than assuming.