Activation Functions

Stack a hundred linear layers and you have built one linear layer. Matrix multiplication is closed under composition: W_3 (W_2 (W_1 x)) is just (W_3 W_2 W_1) x, a single matrix. Every parameter you added collapses. The activation function is the one component that breaks this collapse; the pointwise nonlinearity between layers is the entire reason a deep network can represent something a shallow one cannot. Everything else (attention, normalisation, residuals) is plumbing around that fact. The history of activations is the history of finding a nonlinearity that stays expressive without strangling the gradient that has to flow back through it.

The saturating units and why gradients vanish

The first generation borrowed from biology and statistics. Sigmoid squashes any real input into (0, 1):

sigmoid(x) = 1 / (1 + e^-x)

Tanh is the same shape rescaled to (-1, 1) and zero-centred, which makes it the usually-preferred one. Both share a fatal property for deep networks: they saturate. Once |x| is large the curve flattens, and the derivative goes to zero. For sigmoid the derivative is sigmoid(x)(1 - sigmoid(x)), which peaks at 0.25 and collapses toward zero at both tails.

Backpropagation multiplies these derivatives layer by layer. If each layer contributes a factor of at most 0.25, then ten layers deep the gradient has been multiplied by something on the order of 0.25^10, roughly 10^-6. The early layers receive almost no learning signal; they train at a crawl or not at all. This is the vanishing-gradient problem, and it is why networks stayed shallow for years. Saturating activations were a large part of the cause.

ReLU and the dying-ReLU problem

The rectified linear unit is embarrassingly simple:

ReLU(x) = max(0, x)

It changed everything. For positive inputs the derivative is exactly 1, so gradients pass through undiminished no matter how deep the stack; the vanishing-gradient factor is gone on the active path. It is cheap (a single comparison), and its sparsity (roughly half the units output zero for a given input) can help. ReLU is what made the first wave of very deep convolutional networks trainable.

It has one characteristic failure. Because the gradient for any negative input is exactly zero, a unit that gets pushed into the negative region for every training example stops receiving gradient entirely. It can never recover, since it never updates. It is dead. A large learning rate or an unlucky initialisation can kill a sizeable fraction of the units in a layer, permanently reducing capacity. This is the dying-ReLU problem. The patches are all about giving the negative region a small nonzero slope: Leaky ReLU uses a fixed small slope (say 0.01x), PReLU learns that slope, and ELU curves smoothly into a negative asymptote. Each keeps a trickle of gradient alive on the negative side.

GELU: a smooth, probabilistic gate

ReLU is a hard gate: it multiplies its input by 0 or 1 depending on the sign. GELU (Gaussian Error Linear Unit; Hendrycks and Gimpel, 2016) replaces that hard switch with a smooth, probabilistic one. It weights the input by the probability that a standard normal variable is less than it:

GELU(x) = x * Phi(x)

where Phi is the standard Gaussian cumulative distribution function. For very negative x, Phi(x) is near zero and the input is suppressed; for very positive x it is near one and the input passes; around zero it interpolates smoothly rather than kinking. The intuition is a stochastic regulariser made deterministic: instead of dropping a unit with some probability, you scale it by the probability of keeping it.

The practical payoff is a differentiable curve with no dead flat region and a small negative-input response, which trains more stably than ReLU on transformers. GELU became the default feed-forward activation in BERT and the original GPT models, and an approximate tanh form is often used where the exact erf is slow. It is still everywhere; the gated variants below build directly on it.

The gated family: GLU, GeGLU, SwiGLU

The current default in transformer feed-forward blocks is not a single activation applied to one projection. It is a gate. A Gated Linear Unit (GLU) splits the input across two separate linear projections and lets one of them modulate the other multiplicatively:

GLU(x) = (x W + b) * sigmoid(x V + c)

The elementwise product is the point: one branch decides, per dimension, how much of the other branch to let through. Noam Shazeer's "GLU Variants Improve Transformer" (2020) simply swapped the sigmoid gate for other nonlinearities and measured the result:

Variant	Gate function	Feed-forward form
GLU	sigmoid	`(xW) * sigmoid(xV)`
ReGLU	ReLU	`(xW) * ReLU(xV)`
GeGLU	GELU	`(xW) * GELU(xV)`
SwiGLU	Swish/SiLU	`(xW) * Swish(xV)`

GeGLU and SwiGLU won the comparison on language-modelling perplexity and downstream tasks. Shazeer's own summary is famously deadpan: he attributes the improvement to "divine benevolence", because there is no clean theoretical reason the gated variants should be better; they simply are, consistently, across his experiments. That empirical result is why SwiGLU propagated into PaLM and then into LLaMA and most open-weight models that followed. When people say a modern transformer uses a "SwiGLU MLP", this is what they mean.

The parameter bookkeeping (do not skip this)

A gated feed-forward block has an extra weight matrix. A classic ReLU/GELU MLP has two: an up-projection W_1 (dimension d to d_ff) and a down-projection W_2 (d_ff to d). A SwiGLU block has three: two up-projections W and V (the value branch and the gate branch), plus the down-projection W_2. If you kept d_ff at the usual 4d, the gated block would carry roughly 3/2 the feed-forward parameters of the dense block, which makes any comparison unfair.

The fix is to shrink the hidden dimension so the totals match. Set d_ff to (2/3) * 4d, so that three matrices of the smaller width hold about the same parameter count as two matrices of the full width. LLaMA does exactly this; its feed-forward hidden size is (2/3) * 4d rounded to a hardware-friendly multiple. Whenever you read that a model "uses SwiGLU", assume this 2/3 scaling is in play; it is the reason the parameter counts still line up against a dense baseline.

When it falls down

Sigmoid or tanh in a deep stack. Saturation kills the gradient; reserve these for gates and output heads (a binary probability, an LSTM gate), never as the workhorse activation of a deep network.
ReLU with a large learning rate or poor init. Watch for a rising fraction of always-zero units; that is capacity dying silently. Switch to Leaky ReLU/GELU or lower the learning rate.
Assuming the gated variant is free. SwiGLU adds a third matrix and an elementwise multiply. If you forget the 2/3 hidden-dim adjustment you either bloat the parameter count or run an unfair ablation. The extra projection also means more activation memory during training.
Reaching for exotic activations to fix a training problem. The activation is rarely the bottleneck once you are past ReLU. Normalisation, initialisation, and learning-rate schedule usually matter far more; swapping GELU for the activation of the month is a common way to waste a week.
Approximate vs exact GELU mismatch. The tanh approximation and the exact erf form differ slightly. Training with one and serving with the other introduces a small distribution shift; keep them consistent.