Transformer Architecture

The 2017 paper "Attention Is All You Need" introduced an architecture built entirely on attention plus feed-forward layers. No recurrence, no convolution. Eight years later it still underlies every frontier LLM.

The building block

A transformer block has:

Multi-head self-attention that lets each position attend to every other.
Residual connection + layer norm wrapping the attention.
Feed-forward MLP (typically 4x the model dimension wide).
Residual connection + layer norm wrapping the MLP.

Stack 12, 32, 80 of these and you have GPT-2, GPT-3, GPT-4 respectively.

Encoder vs decoder

Encoder-only (BERT): bidirectional self-attention. Great for classification and embedding.
Decoder-only (GPT family): causal attention mask, generates token by token. The dominant LLM design today.
Encoder-decoder (T5): encoder reads input fully, decoder generates conditioned on encoder output. Still strong for translation and summarisation.

Positional information

Attention is permutation-invariant by itself - it has no idea where each token sits. Position is injected via:
- Sinusoidal positional encoding (original paper).
- Learned absolute positional embeddings (BERT, GPT-2).
- Rotary position embeddings (RoPE, used by Llama, Mistral, PaLM).
- ALiBi linear-bias attention (used by Bloom).

RoPE is the modern default - it generalises to longer contexts than the model was trained on far better than learned embeddings.