NLP Foundations
Transformer Architecture
The encoder-decoder stack that replaced recurrence and powered every modern LLM.
intermediate · 8 min read
The 2017 paper "Attention Is All You Need" introduced an architecture built entirely on attention plus feed-forward layers. No recurrence, no convolution. Eight years later it still underlies every frontier LLM.
The building block
A transformer block has:
- Multi-head self-attention that lets each position attend to every other.
- Residual connection + layer norm wrapping the attention.
- Feed-forward MLP (typically 4x the model dimension wide).
- Residual connection + layer norm wrapping the MLP.
Stack 12, 32, 80 of these and you have GPT-2, GPT-3, GPT-4 respectively.
Encoder vs decoder
- Encoder-only (BERT): bidirectional self-attention. Great for classification and embedding.
- Decoder-only (GPT family): causal attention mask, generates token by token. The dominant LLM design today.
- Encoder-decoder (T5): encoder reads input fully, decoder generates conditioned on encoder output. Still strong for translation and summarisation.
Positional information
Attention is permutation-invariant by itself - it has no idea where each token sits. Position is injected via:
- Sinusoidal positional encoding (original paper).
- Learned absolute positional embeddings (BERT, GPT-2).
- Rotary position embeddings (RoPE, used by Llama, Mistral, PaLM).
- ALiBi linear-bias attention (used by Bloom).
RoPE is the modern default - it generalises to longer contexts than the model was trained on far better than learned embeddings.