Mechanistic Interpretability Primer

Mechanistic interpretability tries to reverse-engineer neural networks into human-understandable algorithms. Not "the model attended to these tokens" - that is attention visualisation, which is shallow and often misleading. Closer to "this set of neurons, organised this way, implements a Python-like operation that recognises city names then routes to a country-lookup circuit." The hope is that if you can read the program, you can audit it for deception, hidden goals, or unsafe capability.

This is a young field. Several core claims are contested. Treat anything below labelled finding as well-supported and anything labelled interpretation as the field's current best guess.

Features as directions in activation space

A neuron is not a feature. Most neurons in a transformer are polysemantic - they fire for multiple unrelated concepts (DNA bases, HTML tags, French words, all in one neuron). This is because models have to represent many more features than they have dimensions, a phenomenon Anthropic call superposition.

A feature in mech-interp is a direction in activation space (a linear combination of neurons) that corresponds to a single human-interpretable concept. The Eiffel Tower feature, the legal-disclaimer feature, the sycophancy feature.

feature_activation = w_feature . hidden_state

If you can find the right w_feature, you can probe for that concept's presence, monitor it during inference, and intervene on it (clamp it to zero, amplify it, swap it).

Sparse autoencoders as the feature-extraction tool

How do you find the directions? You train a sparse autoencoder on hidden states:

Collect millions of activation vectors from a fixed layer.
Train an autoencoder h -> sparse_code -> h_reconstructed with a sparsity penalty on the code.
Each code dimension becomes a candidate feature.
Inspect each feature by finding the input examples that maximally activate it.

The sparsity penalty is the load-bearing trick: it forces the autoencoder to represent each input with few active features, which empirically disentangles superposition.

Anthropic's "Towards Monosemanticity" (Bricken et al, 2023) demonstrated this on a one-layer transformer: SAEs extracted thousands of features that humans could label, far more interpretable than the original neurons.

"Scaling Monosemanticity" (Templeton et al, 2024) scaled the technique to Claude 3 Sonnet, a production model. They extracted ~34 million features. Many were safety-relevant: features for sycophantic praise, deception and manipulation, bioweapon information, power-seeking, unsafe code. Some features could be steered - clamping the Golden Gate Bridge feature to high produced "Golden Gate Claude," a model obsessively self-identifying as the bridge. The same intervention works for safety-relevant features in principle.

Induction heads and circuit-level claims

Finding (Olsson et al, "In-context Learning and Induction Heads", 2022): in the second layer of small transformers, specific attention head pairs implement an induction pattern - "if the current token is X and earlier the sequence had X followed by Y, predict Y." The circuit:

Features as directions in activation space

Sparse autoencoders as the feature-extraction tool

Induction heads and circuit-level claims

Keep reading with Pro.