← Concept library

Safety & Alignment

Alignment Evaluations and Frontier-Model Risk

How frontier labs and governments measure dangerous capabilities, what an eval-gated release looks like, and where the regulatory regime sits in 2026.

intermediate · 8 min read

In 2023 a frontier model shipped when the lab decided it was ready. By 2026 a frontier model ships when (a) the lab's internal capability evals come in under threshold, (b) external evaluators (UK AISI, US AI Safety Institute Consortium) have had pre-deployment access, and (c) the deployment matches the lab's published safety policy. This section is a map of the regime: what gets measured, by whom, and what "eval-gated release" actually means.

Dangerous-capability evaluations: what gets measured

Frontier labs converged on four high-stakes capability domains:

  1. CBRN uplift. Chemical, biological, radiological, nuclear. Can the model meaningfully help a non-expert plan or execute mass-casualty harm? Measured with closed-vocabulary expert evals and benchmarked against textbook / search baselines.
  2. Cyber offence. Vulnerability discovery, exploit development, autonomous compromise of test networks. CTF-style harnesses, increasing in realism.
  3. Autonomy and agentic capability. Can the model self-replicate, acquire resources, persist across reboots, conduct long-horizon tasks without supervision? METR-style task suites are the current standard.
  4. AI R&D acceleration. Can the model meaningfully speed up the research that produces stronger models? This one is mostly internal-lab and not yet standardised.

Each lab also runs misuse evals (the AILuminate-style hazard suite covered earlier) and alignment evals (sycophancy, deception, scheming).

Anthropic Responsible Scaling Policy and ASL levels

Anthropic's RSP defines AI Safety Levels (ASL), analogous to biosafety levels:

Level Capability threshold Required safeguards
ASL-1 No meaningful misuse risk (small models) Standard practices
ASL-2 Today's frontier models. Evidence of dangerous capability but not uplifting attackers meaningfully Current deployment + security baseline
ASL-3 Provides meaningful uplift on CBRN or has substantial autonomy Hardened deployment (jailbreak-resistance bar), enhanced security (insider threat, weight protection), red-teaming
ASL-4 Substantial autonomy or near-expert CBRN uplift Stronger versions of ASL-3; specifics being defined
ASL-5 Defined later Defined later

The policy is thresholds-and-commitments: cross a capability threshold, deployment is paused until corresponding safeguards are demonstrated. As of mid-2026 the RSP is at v3.3, with the headline change being mandatory publication of Frontier Safety Roadmaps and quantitative Risk Reports.

OpenAI Preparedness Framework

OpenAI's analogous framework tracks four risk categories - Cybersecurity, CBRN, Persuasion, Model Autonomy - and rates each model on a Low / Medium / High / Critical scale per category, both pre- and post-mitigation. Models above "Medium" post-mitigation in any category cannot be deployed. Models above "High" cannot be developed further until safeguards close the gap. The framework has been updated several times since 2023, generally toward more specific eval criteria and faster cadence.

UK AISI, US AISIC, and pre-deployment evaluations

The UK AI Security Institute (AISI, formerly AI Safety Institute) is a state-backed research body inside the Department for Science, Innovation and Technology. It conducts technical capability evals on frontier models, including pre-deployment access agreements with Anthropic, OpenAI, Google DeepMind, and Meta. Their published reports cover dangerous-capability batteries on the most recent frontier releases.

The US AI Safety Institute Consortium (AISIC), housed at NIST, plays an analogous coordinating role. Together with bilateral arrangements (the UK-US MoU on AI safety) and the broader AI Safety Institute network, this is the closest thing to an international evaluation regime.

Critically: external eval access does not mean external veto. The labs retain release decisions. The institutes provide capability findings; the labs decide what to do with them.

EU AI Act and high-risk classification

The EU AI Act (Regulation 2024/1689, in force since 2024) classifies AI systems into prohibited / high-risk / limited-risk / minimal-risk tiers, with frontier general-purpose AI models (above a training-compute threshold - currently 10^25 FLOPs, designated as having "systemic risk") subject to additional obligations: model evaluations, adversarial testing, incident reporting, cybersecurity protections, and energy-use reporting.

For most LLM deployments the practical impact is in the high-risk application categories (employment, education, law enforcement, critical infrastructure), where deployment requires conformity assessments, technical documentation, post-market monitoring, and human oversight.

What an eval-gated release looks like in practice

A simplified picture of a 2026 frontier release:

  1. Capability eval pre-training-completion. Track scaling curves; project where dangerous-capability thresholds will be hit.
  2. Internal red-team and capability eval. Jailbreak resistance, CBRN uplift, autonomous-task benchmarks, persuasion, alignment evals.
  3. External red-team. UK AISI, US AISIC, sometimes contracted academic teams. Pre-deployment access under NDA.
  4. Threshold check. Are post-mitigation scores below the policy threshold for the corresponding ASL / Preparedness level? If not, mitigate (refusal training, capability surgery, deployment restrictions) and re-test.
  5. Safety case publication. Model card with capability scores, mitigations applied, known residual risks.
  6. Deployment with monitoring. Live abuse monitoring, jailbreak telemetry, incident response. Patches via classifier updates and minor fine-tunes.

The bar is moving. The 2023 version of step 3 did not exist for any lab. The 2026 version is standard for the four largest. Whether the bar moves fast enough to keep up with capability is the open question every paragraph in this section dances around.

What works, what doesn't

  • Works. Pre-deployment external eval (catches things labs missed). Published safety policies with thresholds (forces accountability). Capability ablation via fine-tuning on dangerous-knowledge data (real, measurable harm reduction).
  • Partially works. Voluntary lab commitments (no enforcement teeth beyond reputation). National AISIs (limited budgets, limited compute access, depend on lab cooperation).
  • Does not work. Self-attestation without external check. Compute-threshold-only regulation (capability per FLOP keeps improving). Assuming a passed eval generalises to off-distribution use.

Further reading