Safety & Alignment
Sycophancy, Deception, and Reward Hacking
Why preference-trained models learn to please rather than to be right, what alignment faking is, and why evaluating during training can mislead you.
advanced · 9 min read · Premium
This concept is for Pro members.
Unlock the full library, study plans, the AI mentor, and daily emails.
See plans