Residual Connections and Skip Paths

Before 2015, stacking more layers past a certain depth made networks worse, and not because they overfit. A 56-layer plain convolutional network had higher training error than a 20-layer one (He et al., "Deep Residual Learning for Image Recognition", 2015). That is a paradox: the deeper network can represent everything the shallow one can (set the extra layers to identity) and yet optimisation cannot find that solution. The fix turned out to be almost embarrassingly small. Instead of asking a block to compute an output H(x) from scratch, wire its input straight to its output and ask it to compute only the difference: y = F(x) + x. That single + x is what let networks go from tens of layers to hundreds, and it is the same wire running down the spine of every transformer today.

The degradation problem and the identity shortcut

The observation that motivated ResNet was the degradation problem: as depth increases, accuracy saturates and then degrades rapidly, and the degradation shows up in training error, so it is an optimisation failure, not an over-capacity one. In principle a deep plain network should be able to emulate a shallow one by driving the surplus layers towards the identity mapping. Empirically, stochastic gradient descent on a stack of nonlinear layers struggles to learn identity; pushing a ReLU(Wx + b) block to reproduce its input exactly is a surprisingly awkward target.

Residual learning reframes the target. Let the desired mapping be H(x). Rather than have the stacked layers fit H(x) directly, have them fit the residual F(x) = H(x) - x, and recover the output as:

y = F(x) + x

If the optimal thing a block can do is nothing, the network only has to drive the weights of F towards zero, which regularisation and initialisation already bias it towards. Learning "add a small correction to what you were given" is easier than learning "reconstruct the whole signal", and the + x shortcut carries no parameters, so it costs essentially nothing. With this reformulation, He et al. trained networks of 152 layers on ImageNet (and over 1,000 layers on CIFAR-10 as a stress test) that improved with depth instead of degrading.

The shortcut has to be an identity to get the full benefit. When F(x) and x differ in dimension (a downsampling block, say), you project x with a small linear map, but the paper found that identity shortcuts wherever the shapes match, with projections used only where necessary, works best and cheapest.

The gradient highway

The deeper reason residual connections help is what they do to backpropagation. Consider a chain of blocks where each computes x_{L} = x_{L-1} + F_L(x_{L-1}). Unrolling from the input x_l to a later activation x_L:

x_L = x_l + sum over i from l+1 to L of F_i(x_{i-1})

Differentiate the loss with respect to the early activation x_l. Because of that additive form, the gradient flowing back to x_l contains a term that is the downstream gradient multiplied by 1 (the derivative of the + x path), plus terms routed through the F_i blocks:

dLoss/dx_l = dLoss/dx_L * (1 + d/dx_l [ sum of F_i ])

The leading 1 is the point. In a plain deep network the gradient is a long product of Jacobians; if their singular values sit below one, the product shrinks geometrically and early layers receive almost no signal (vanishing gradients), and if above one it explodes. The identity path adds a term that passes the gradient straight through with a factor of one, so even when the F_i Jacobians are small the early layers still get a clean copy of the downstream gradient. The skip path acts as a highway that gradients travel along without attenuation, while the residual branches contribute their corrections on top. This is why "residual connection", "skip connection", and "shortcut" all name the same wire: it is a shortcut for the forward signal and a highway for the backward gradient at once.

Residual connections in the transformer

The transformer inherited this wholesale. Each transformer layer is not one function but two sublayers, self-attention and a position-wise feed-forward network (FFN), and each sublayer is wrapped in its own residual connection. The original formulation (Vaswani et al., 2017) is post-norm: the sublayer runs, its output is added to the input, and layer normalisation is applied to the sum.

# Post-norm (original transformer)
x = LayerNorm(x + SelfAttention(x))
x = LayerNorm(x + FFN(x))

Modern large models almost all use pre-norm instead, moving the normalisation inside the residual branch so the skip path stays a clean identity:

# Pre-norm (GPT-2 onward, most current LLMs)
x = x + SelfAttention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

The difference looks cosmetic and is not. In the post-norm layout the normalisation sits on the residual path, so the identity signal is squashed and rescaled at every layer; the clean 1 in the gradient derivation above is no longer clean. Xiong et al. ("On Layer Normalization in the Transformer Architecture", 2020) showed analytically that a post-norm transformer has large, badly scaled gradients near the output layer at initialisation, which is why the original recipe needed a learning-rate warm-up to avoid diverging early in training. Put the normalisation inside the residual block, as pre-norm does, and the gradients are well behaved at initialisation; deep transformers then train stably and often without warm-up at all. As models grew past a dozen layers, pre-norm became the default precisely because it keeps the gradient highway open all the way down the stack.

There is a second, more conceptual way to read the transformer's residual connections, from mechanistic interpretability. Anthropic's "A Mathematical Framework for Transformer Circuits" (2021) treats the running x as a residual stream: a shared communication channel that every component (token embeddings, each attention head, each MLP) reads from and writes to via its residual add. Nothing is ever overwritten; each sublayer adds its contribution into subspaces of the stream, and later layers read those contributions back out. Under this view the residual connections are not just an optimisation aid but the actual medium through which information moves through the network, which is why the framework decomposes a transformer into paths through the residual stream rather than into isolated layers.

When it falls down

Pre-norm can let the residual path dominate. Because pre-norm always adds an un-normalised identity, the magnitude of the stream tends to grow with depth while each sublayer's relative contribution shrinks. In very deep pre-norm stacks the later layers can end up making only small corrections, an effect sometimes described as representational or identity dominance, which wastes some of the depth you paid for. Careful residual scaling (for example DeepNorm-style weighting, or the scaled initialisation used in GPT-2) mitigates this.
Post-norm is not strictly worse. It regularises the signal more aggressively and can reach slightly stronger final quality when you can afford the warm-up and tuning to train it, which is why some setups still prefer it. The trade is stability and ease of training (pre-norm) against peak performance under careful tuning (post-norm).
The residual add assumes matching shapes and scales. Skip connections only pass through cleanly when input and output live in the same space at compatible magnitudes. Cross-dimension shortcuts need a projection, and mixing sublayers whose outputs differ wildly in scale from the stream can destabilise the sum.
Residual paths do not fix rank or diversity collapse on their own. Deep transformers can still drift towards representations where token vectors become increasingly similar across depth; the skip connection keeps gradients flowing but does not guarantee the representations stay expressive. Normalisation choice, attention design, and initialisation all still matter.

The degradation problem and the identity shortcut

The gradient highway

Residual connections in the transformer

When it falls down

Further reading