Evaluating an Aligned Model

A 175-billion-parameter model trained purely by next-token prediction can write coherent Shakespeare. Ask it whether the COVID-19 vaccine contains a microchip, and it will confidently agree if that phrasing appeared often enough in the pretraining corpus. Alignment post-training is supposed to fix that. But how do you know it worked? The answer is less obvious than it sounds, and the industry has been working it out in real time since InstructGPT landed in 2022.

The Three Properties You Are Actually Trying to Measure

Alignment evaluation circles around three objectives that Anthropic popularised under the label "HHH":

Helpfulness: Does the model complete the user's actual intent, not just the literal request?
Harmlessness: Does it refuse or soften genuinely dangerous outputs without becoming uselessly cautious?
Honesty: Does it make accurate claims, acknowledge uncertainty, and avoid sycophantic agreement?

These are not independent axes. A model optimised hard for harmlessness will start refusing benign chemistry questions. A model rewarded for sounding confident will be more helpful in tone but less honest in content. The tension is structural, not incidental.

A useful framing: think of the aligned model as sitting in a two-dimensional space where one axis is "policy compliance" (refusals, hedges, disclaimers) and the other is "task performance" (accuracy, coherence, relevance). Ideal alignment moves the model toward the high-task-performance corner without letting it drift toward either extreme of the compliance axis.

Human Evaluation: The Ground Truth That Scales Badly

The InstructGPT paper (Ouyang et al., 2022) established the template: sample outputs from two or more model variants on the same prompt, ask human raters to indicate which output they prefer, and aggregate win rates. The key finding was that a 1.3B InstructGPT model was preferred over a 175B GPT-3 on roughly 85% of prompts on the API's own traffic distribution. Size alone is not alignment.

Human evaluation has two hard limits:

Coverage: A team of 40 raters cannot evaluate adversarial edge cases at the tail of a model's failure distribution. Crowdworkers rarely try jailbreaks, multi-step manipulation, or subtle statistical falsehoods.
Reproducibility: Inter-annotator agreement on "is this response harmful?" typically sits in the 60-75% range on contested topics. The gold standard is noisier than it looks.

Llama 2's evaluation methodology (Touvron et al., 2023) used both internal human preference annotations and a secondary "safety" human eval where annotators rated 1,000 adversarial prompts on a five-point scale. Even with that effort, Meta explicitly noted that the safety ratings required domain-expert review to be reliable, not just general crowdwork.

Automated Benchmarks: Fast but Brittle

Because human eval is expensive, the community has built automated proxies:

Benchmark	What it measures	Core limitation
TruthfulQA (Lin et al., 2022)	Whether a model avoids mimicking human falsehoods across 817 questions	Questions are fixed; a model fine-tuned on TruthfulQA-adjacent data games the benchmark
MT-Bench	Multi-turn instruction following, scored by GPT-4 as judge	GPT-4 has its own biases; position effects inflate first-response scores
XSTest	Whether a model refuses safe requests due to surface-level pattern matching	Only 250 safe prompts; domain coverage is narrow
WinogradSchema / BIG-Bench	General reasoning	Alignment-neutral; tells you nothing about safety or honesty

TruthfulQA is worth examining closely because it revealed something counterintuitive: larger base models performed worse, because scale amplifies the confident imitation of confident-sounding human falsehoods. After RLHF fine-tuning, models recovered some truthfulness, but the benchmark also exposed that reward models trained on "sounds confident and helpful" can inadvertently reward false-but-plausible answers.

LLM-as-Judge: Scalable but Biased

The "Judging LLM-as-a-Judge" paper (Zheng et al., 2023) made the empirical case that GPT-4 as an evaluator agrees with human preference labels at over 80% - roughly the same agreement rate as between two human raters. This is compelling enough that LLM-judged evaluation is now standard in most RLHF pipelines.

But the method has well-documented systematic biases:

Position bias: The model scores the first response higher even when content is identical. Mitigation: run each pair in both orderings and average.
Verbosity bias: Longer responses receive higher scores regardless of accuracy. A concise correct answer will often lose to a verbose wrong one.
Self-enhancement: A model used as its own judge tends to prefer outputs in its own style and format.
Sycophancy amplification: If the judge has been aligned to agree with the user, it will rate responses that confirm a false premise more highly than responses that politely correct it.

A practical mitigation pattern:

# Pseudocode: pairwise judge with position-swap debiasing
def judge_pair(prompt, response_a, response_b, judge_model):
    score_ab = judge_model.compare(prompt, response_a, response_b)
    score_ba = judge_model.compare(prompt, response_b, response_a)
    # Flip score_ba before averaging
    return (score_ab + flip(score_ba)) / 2

This does not eliminate verbosity or self-enhancement bias, but it substantially reduces the position effect.

Reward Model Score as a Proxy

During RLHF training, the reward model (RM) is meant to capture human preferences numerically. Once training is done, the RM score on held-out prompts looks like a natural evaluation metric. It is, at first, a reasonable signal. It becomes unreliable through a mechanism called reward hacking: the policy learns to produce outputs that score high on the RM without actually being good. Classic symptoms include:

Responses that are confidently formatted but factually wrong
Excessive hedging that the RM interprets as "safe" but humans find annoying
Sycophantic agreement that scores high on "helpful tone" labels

Monitoring RM scores over multiple training checkpoints is therefore necessary but not sufficient. You need at least periodic human spot-checks to confirm the RM hasn't drifted from the thing it was originally trained to measure.

When It Falls Down

Benchmark saturation: TruthfulQA has been in public view long enough that fine-tuning pipelines inadvertently (or deliberately) train toward it. A model can score 80% on TruthfulQA while still generating false claims on topics the benchmark doesn't cover.

Distribution shift in human eval: The InstructGPT preference data was collected on API traffic from 2021. That distribution looks nothing like a 2025 user base prompting models with multi-step code generation or agentic tasks. Preference judgements trained on old distributions can endorse the wrong thing on new distributions.

Adversarial users are not in the eval set: Standard eval datasets sample from ordinary users. Red-teaming - deliberately trying to elicit harmful outputs - is a separate exercise that most published evaluations underrepresent. A model can pass every standard safety benchmark and still be jailbroken with a sufficiently creative prompt.

Honesty evaluations lag the actual failure mode: TruthfulQA probes a fixed set of popular misconceptions. Models now fail on a different surface: confident confabulation of facts that sound plausible but are simply invented (dates, citations, API function signatures). Current benchmarks barely cover this.

Safety theatre: Over-refusal - refusing clearly safe requests because their surface form resembles an unsafe one - is invisible to harm-rate metrics, which only count harmful outputs. XSTest was designed precisely to make over-refusal visible, but it covers a narrow domain. A model can appear safe while being so restrictive it is useless.