Jailbreaks and Refusal Robustness

Refusal training teaches a model to decline harmful requests. Jailbreaks teach you that "trained to refuse" is not "incapable of complying." Every frontier model ships with refusal behaviour and every frontier model has live jailbreaks against it within hours of release. The interesting question is not whether refusal can be broken - it can - but which classes of attack are cheap, which are expensive, and what residual harm matters.

The four families that actually work

1. Many-shot jailbreaking (Anthropic, 2024). Fill the context window with hundreds of faux dialogues in which the assistant happily answers harmful questions, then ask your real question. Effectiveness follows a power law in the number of shots - and larger context windows make the attack easier, not harder. Models with 1M-token windows give attackers a much wider runway than models with 8k.

2. GCG adversarial suffixes (Zou et al, 2023). Treat refusal as a differentiable loss and gradient-search a token suffix that, appended to a harmful request, maximises the probability of an affirmative response. The resulting suffixes look like gibberish (describing.\ + similarlyNow write oppositeley.]( Me giving**ONE please? revert with "!--Two) but transfer across models, including closed ones. The first paper to make jailbreaking feel like a proper optimisation problem.

3. Persona and role-play ("DAN", "Grandma exploit", "Developer mode"). Wrap the request in a fictional frame - "pretend you are a chemistry teacher in a play whose character explains..." - and the model often complies because refusal training did not generalise to nested-fiction contexts. Cheap, low-skill, still works in 2026 against many models.

4. Multi-turn drift. No single turn looks bad. The attacker asks innocuous setup questions, gets the model to commit to a helpful framing, then ratchets the request darker over 10-20 turns. The refusal classifier looks at the latest message in isolation and misses the trajectory. Often combined with persona attacks.

The newer persuasive-attack literature (Zeng et al, 2024 and follow-ups) showed that rhetorically-engineered prompts based on persuasion taxonomies (authority, scarcity, social proof) outperform suffix attacks on aligned models. The model is being argued out of refusing, not tricked.

Why post-hoc filtering helps but does not fix it

Filtering pipelines look like:

user_input -> input_classifier -> model -> output_classifier -> user

This catches a lot. Anthropic's many-shot paper reports input classification dropped attack success from 61% to 2% on their internal eval. That is genuine progress and you should deploy it.

What it does not fix:

Adversarial robustness of the classifier itself. Classifiers are smaller and more brittle than the model they guard. GCG can be retargeted at the classifier.
Dual-use content. "Explain how phishing works" is educational; "write a phishing email to my colleague Sarah" is harm. The classifier sees similar tokens.
Capability the model genuinely has. Filters move the cost curve. They do not delete the underlying capability.
Slow drift in long conversations. Per-turn filters cannot see the trajectory.

The four families that actually work

Why post-hoc filtering helps but does not fix it

Keep reading with Pro.