← Concept library

Safety & Alignment

Mechanistic Interpretability Primer

How sparse autoencoders extract human-interpretable features from model activations, what circuit-level analysis buys you for safety, and where the science is still contested.

advanced · 10 min read · Premium

This concept is for Pro members.

Unlock the full library, study plans, the AI mentor, and daily emails.

See plans