← Concept library

Safety & Alignment

Sycophancy, Deception, and Reward Hacking

Why preference-trained models learn to please rather than to be right, what alignment faking is, and why evaluating during training can mislead you.

advanced · 9 min read · Premium

This concept is for Pro members.

Unlock the full library, study plans, the AI mentor, and daily emails.

See plans