Red-Teaming and Adversarial Evaluation

A model that scores 90% on a safety benchmark can still be jailbroken in twenty queries by an undergraduate with a clever prompt. Benchmark scores measure expected-case behaviour on a fixed test set; deployed safety is determined by worst-case behaviour on the long tail of inputs a real adversary will craft. Red-teaming is the discipline of finding those worst-case inputs before someone else does.

Why benchmark scores do not predict deployed safety

Safety benchmarks (ToxiGen, HarmBench, Anthropic's HHH evals) sample from a curated distribution of "harmful prompts." Real adversaries optimise their prompts against your specific guardrails. The gap is large enough that public safety scores correlate poorly with how a model behaves under sustained adversarial pressure. A model with a 1% attack-success rate on a static benchmark can have an 80%+ attack-success rate when an automated attacker iterates against it.

Automated red-teaming

Method	Mechanism	What it finds
PAIR (Chao et al. 2023)	Attacker LLM iteratively refines jailbreak prompts via black-box queries	Semantic jailbreaks in fewer than 20 queries against GPT-3.5/4, Vicuna, Gemini
GCG (Zou et al. 2023)	Greedy coordinate gradient search builds adversarial suffix tokens	Universal suffixes that transfer across GPT, Claude, Bard, open-source models
Rainbow Teaming, Tree of Attacks	Diversity-aware search over the attack space	Coverage of multiple harm categories rather than a single jailbreak

PAIR is the canonical black-box attack: no model weights needed, just API access. GCG is the canonical white-box attack and demonstrated that adversarial suffixes optimised on open-source models transfer to closed-source frontier models with high success rates - the safety alignment of GPT-4 and Claude can be circumvented by tokens computed against Llama.

Human red-teams

Automated attacks find narrow technical vulnerabilities; human red-teams find creative misuse patterns automated systems cannot anticipate. Frontier labs (Anthropic, OpenAI, Google DeepMind) run structured human red-teaming programmes, often with domain specialists - biosecurity experts probe CBRN uplift, cybersecurity researchers probe vulnerability discovery, lawyers probe legal advice failure modes. The UK AI Security Institute and the US AI Safety Institute Consortium under NIST coordinate independent third-party evaluation against frontier models pre-release.

Framings worth knowing

OWASP LLM Top 10 (2025). Application-security framing: prompt injection, sensitive information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, unbounded consumption. The right checklist for anyone shipping an LLM-backed product.
NIST AI RMF and the Generative AI Profile. Risk-management framing - identify, measure, manage, govern. The compliance-friendly vocabulary US enterprises will increasingly demand.
AISI and the AISIC frontier evaluations. Independent capability and safety evaluations of pre-release frontier models, with a focus on dual-use scientific risk and autonomy.
AILuminate (MLCommons). A standardised safety benchmark covering 12 hazard categories with 24,000+ prompts per language across English, French, and Chinese. The closest the field has to a publishable, comparable safety score.

Why benchmark scores do not predict deployed safety

Automated red-teaming

Human red-teams

Framings worth knowing

Keep reading with Pro.