RLAIF and Constitutional Feedback

Anthropic's 2022 Constitutional AI paper trained a model to be helpful and harmless using zero human labels for harmful content. The entire preference signal came from another language model reading a plain-text list of principles and deciding which of two candidate responses was less harmful. That is the kernel of RLAIF: replace the human rater with an AI rater.

The appeal is immediate. Human labelling is slow, expensive, and psychologically costly for the labellers who must read harmful material at scale. An AI rater runs at inference speed, costs fractions of a cent per comparison, and can be audited by reading its instruction prompt. The question, of course, is whether the AI rater is trustworthy enough to produce useful training signal - and that is where Constitutional AI's two-phase design does real work.

What RLAIF Changes (and What It Inherits)

Standard RLHF has three stages: supervised fine-tuning (SFT), reward-model training on human preference pairs, and RL optimisation against the reward model. RLAIF swaps the human annotation in stage two for an LLM that reads a prompt such as:

Python source
    │
    ▼
TorchDynamo      (trace and capture the graph)
    │
    ▼
AOT Autograd     (capture the backward pass ahead-of-time)
    │
    ▼
Compiler backend (default: TorchInductor)
    │
    ▼
Triton / C++     (generated kernel code)

The LLM outputs a probability distribution over {A, B}. Those soft probabilities become the preference labels that train the reward model. Everything downstream - reward-model architecture, PPO loop, KL penalty against the reference policy - is identical to RLHF. RLAIF is a drop-in substitution at the labelling stage, not a new RL algorithm.

A 2023 Google DeepMind study (Lee et al., ICML 2024) tested RLAIF and RLHF head-to-head on summarisation and dialogue. Win-rate against supervised fine-tuning baselines was comparable, and a variant called direct-RLAIF - where the reward signal comes straight from the LLM during RL training rather than from a separately trained reward model - matched or exceeded the canonical RLAIF setup. The result suggests that much of RLHF's value comes from the preference signal structure rather than specifically from human judgement.

Constitutional AI: Adding a Written Ruleset

Constitutional AI (CAI) goes one step further. Rather than feeding the AI judge a single principle per comparison, it gives the judge a constitution: a short document of 10-20 numbered principles covering helpfulness, honesty, harm avoidance, and non-deceptiveness. The Anthropic team published their constitution alongside the paper; it draws from the UN Declaration of Human Rights, Apple's terms of service, and their own alignment research.

CAI's training pipeline has two distinct phases.

Phase 1 - Supervised Learning from Self-Critique (SL-CAI)

Prompt a helpful-only SFT model with a potentially harmful query.
Ask the model to critique its own response against a randomly sampled constitutional principle.
Ask the model to revise the response to address the critique.
Repeat the critique-revise loop up to a small fixed number of times (typically 1-4 rounds).
Fine-tune on the final revised responses.

A simplified trace:

?wzxhzdk:1?

The SL-CAI model after this phase is more conservative than the raw SFT model but much less evasive than a model trained to simply refuse everything.

Phase 2 - Reinforcement Learning from AI Feedback (RL-CAI)

Sample pairs of responses from the SL-CAI model.
Ask a feedback model to compare each pair against a constitutional principle, producing a soft preference label.
Train a preference model (PM) on these AI-labelled pairs.
Run PPO against the PM, with a KL penalty to prevent collapse toward the reference policy.

The feedback prompt explicitly cites the principle being evaluated, so the preference signal is interpretable - you can audit why the PM learnt to prefer one response over another by reading the constitution it was evaluated against.

Why the Constitution Matters

A raw AI judge (RLAIF without a constitution) has no explicit specification of what "better" means. Its preferences reflect whatever biases are baked into its pretraining and RLHF history. A constitution externalises the specification into plain text. That has three practical consequences.

Auditability. Disagreements about model behaviour can be traced to a specific principle. If the model refuses a request that should be allowed, you can check whether any principle plausibly covers it - and edit the constitution if not.

Consistency. Sampling a random principle per comparison introduces variance, but across thousands of comparisons the distribution covers the full constitution. This is preferable to a monolithic "is this response better?" prompt, which collapses all principles into a single unconstrained preference.

Iteration speed. Updating the model's values requires editing a text file, not re-running a human annotation campaign. Principle additions, removals, and rewording can be tested cheaply before committing to a full RL run.

The table below compares the labelling strategies:

Approach	Label source	Auditability	Cost at scale
RLHF	Human raters	Low (rater reasoning opaque)	High
Raw RLAIF	LLM (no explicit principle)	Medium	Very low
Constitutional AI (RL-CAI)	LLM + written constitution	High	Very low

When It Falls Down

Constitution blind spots. Principles are written by researchers with specific cultural, professional, and linguistic backgrounds. Harms that fall outside the framers' experience may be absent from the constitution entirely. A model trained on such a constitution will be systematically less cautious about those blind-spot categories, with no obvious signal that anything is wrong.

Principle conflicts. Helpfulness and harm avoidance pull in opposite directions on a large fraction of real queries. The constitution does not specify a resolution order. The feedback model resolves conflicts however its own pretraining disposes it to - which may differ from what the constitution's authors intended.

Compounding model errors. The critique model, the revision model, and the feedback model are all LLMs that can hallucinate, misapply principles, or add subtle biases. Errors compound across the pipeline. A misleading critique leads to a poorly revised response, which becomes training data, which shifts the SL-CAI model, which generates the pairs the feedback model later compares. There is no hard error-correction mechanism at any stage.

3. Writes its output back to DRAM. Empirically, CAI-trained models tend to be more cautious than the underlying human-preference signal would justify. The self-critique phase can spiral: the model critiques a safe response as potentially problematic because a principle torch.fx.Graph be read to cover it, then revises away useful content. This over-refusal can persist into the RL phase if the preference model also inherits a cautious prior.

Reward model generalisation. The PM is trained on AI-labelled comparisons drawn from a limited prompt distribution. Out-of-distribution queries - unusual languages, niche technical domains, adversarial jailbreak formats - may receive unreliable preference scores. This is a standard reward-hacking risk in all RL-from-feedback pipelines, not unique to RLAIF, but the absence of human review of the labels removes one error-catching layer.

AI judge size sensitivity. Lee et al. (2024) found that using a smaller AI judge than the policy being trained can produce noisy, uninformative preferences. The AI judge needs to be at least as capable as the model it is evaluating for the labels to carry useful signal. This partially limits the cost savings: if you are training a frontier-class model, your AI judge must itself be frontier-class.

What RLAIF Changes (and What It Inherits)

Constitutional AI: Adding a Written Ruleset

Why the Constitution Matters

When It Falls Down

Further Reading