Background

When Vaswani and colleagues published Attention Is All You Need in 2017, the Transformer they described was a translation machine. It had two halves: an encoder that read a French sentence and a decoder that wrote the English one, joined by a bridge of attention. The design was symmetric and deliberate, a direct answer to the recurrent sequence-to-sequence models it replaced. Two years later, OpenAI's GPT-2 took that architecture, deleted the encoder, deleted the bridge, and trained the remaining half on 40 GB of Reddit-linked web text. The result was not a worse translator. It was a system that could write coherent paragraphs, answer questions, and summarize articles without being trained to do any of those things specifically.

The interesting question is not that GPT-2 worked. It is why removing half the architecture made it more general, not less. The answer is that the encoder and decoder were never two different ideas. They were two configurations of the same primitive, self-attention, and language modeling only needs one of them.

Why this matters: Almost every large language model shipped since 2020, including the GPT, Claude, and Llama families, is a decoder-only Transformer descended directly from GPT-2's pruning of the original design. Understanding which pieces survived and which were cut is understanding the shape of the entire field.

TL;DR

The original Transformer (Vaswani et al., 2017) is an encoder-decoder built for machine translation; the encoder reads the whole source at once, the decoder generates the target one token at a time.
Self-attention replaced recurrence. Instead of passing a hidden state along a chain, every token directly attends to every other token in a single parallel operation, which is what made training on GPUs at scale practical.
The encoder and decoder differ in exactly one respect: the decoder's self-attention is masked so a token cannot see the future. That single change is the difference between "understand this text" and "predict the next word."
GPT-2 is decoder-only. It keeps masked self-attention, drops the encoder and the cross-attention that connected the two halves, and trains purely to predict the next token.
Attention cost grows quadratically with sequence length, $O(n^2 d)$. This is the Transformer's defining tradeoff and the reason GPT-2 capped context at 1,024 tokens.
Scale, not architecture, is GPT-2's headline. The four sizes (117M to 1.5B parameters) share an identical design; performance on unseen tasks rose smoothly with size, foreshadowing the scaling laws that followed.

At a Glance

The whole architecture, in the form a reader can hold in their head: the original Transformer is two stacks; GPT-2 is the right-hand stack on its own.

flowchart LR
  subgraph Original["Original Transformer (2017)"]
    direction TB
    SRC[Source tokens] --> ENC[Encoder stack<br/>self-attention]
    TGT[Target tokens so far] --> DEC[Decoder stack<br/>masked self-attention]
    ENC -- cross-attention --> DEC
    DEC --> OUT1[Next target token]
  end
  subgraph GPT2["GPT-2 (2019)"]
    direction TB
    TOK[Tokens so far] --> DEC2[Decoder stack<br/>masked self-attention]
    DEC2 --> OUT2[Next token]
  end

The deletions are the story: no separate source, no encoder, no cross-attention. What remains is a single stack that consumes its own output.

Before Attention

To see why attention mattered, look at what it displaced. Through the mid-2010s, the dominant approach to sequence tasks was the recurrent neural network, usually an LSTM (Hochreiter and Schmidhuber, 1997) wired into an encoder-decoder for translation (Sutskever et al., 2014). An RNN reads a sentence one word at a time, folding each new word into a fixed-size hidden state that it carries forward. The hidden state is the model's entire memory of everything it has read.

This design has two structural problems. First, it is inherently sequential: to compute the state at position 50 you must have already computed positions 1 through 49, so the work cannot be parallelized across the sentence. On the GPUs that were becoming the engine of deep learning, that is a crippling limitation. Second, the fixed-size hidden state is a bottleneck. A long sentence must be squeezed through the same vector as a short one, and information from early words tends to decay by the time the model reaches the end.

The first crack came from attention as an add-on to RNNs. Bahdanau et al. (2014) let the decoder, at each output step, look back over all the encoder's hidden states and take a weighted average, with the weights learned. The decoder no longer relied on a single compressed summary; it could attend to the relevant source words directly. Translation quality jumped, especially on long sentences.

The Transformer's move was to ask: if attention is doing the heavy lifting, do we need the recurrence at all? The paper's title is the answer. Remove the RNN entirely, and let attention handle both the within-sentence relationships and the across-sentence ones.

timeline
  title From recurrence to decoder-only
  1997 : LSTM solves long-range gradient flow in RNNs
  2014 : Seq2seq with LSTMs; Bahdanau adds attention to the decoder
  2017 : Transformer drops recurrence entirely (Attention Is All You Need)
  2018 : GPT-1 (decoder-only) and BERT (encoder-only) split the architecture
  2019 : GPT-2 scales decoder-only to 1.5B params, shows zero-shot generality
  2020 : GPT-3 (175B) confirms scaling laws; decoder-only becomes the default

[IMAGE: Side-by-side schematic of an LSTM encoder-decoder versus the Transformer, with the sequential hidden-state chain on the left and the all-pairs attention fan-out on the right, annotated with "sequential, O(n) depth" and "parallel, O(1) depth"]

How It Actually Works

Strip away the surrounding machinery and a Transformer is a stack of identical blocks, each of which does two things to a sequence of vectors: it lets the vectors exchange information (attention), then it processes each vector independently (a feed-forward network). Everything else, the embeddings, the normalization, the residual connections, exists to make that loop trainable and deep.

Tokens become vectors

Text first becomes a sequence of integer token IDs. GPT-2 uses byte-pair encoding (Sennrich et al., 2016) with a vocabulary of 50,257 tokens, a scheme that splits common words into single tokens and rare words into pieces, so no input is ever out-of-vocabulary. Each ID indexes into an embedding matrix, producing a vector of dimension $d_{model}$ (768 in the smallest GPT-2, 1,600 in the largest).

Attention has no inherent sense of order; it treats its input as a set, not a sequence. So position must be injected explicitly. The original Transformer added fixed sinusoidal position encodings; GPT-2 instead learns a position embedding for each of its 1,024 slots and adds it to the token embedding. The model now knows both what each token is and where it sits.

Scaled dot-product attention

This is the load-bearing operation. For each token, the model produces three vectors by multiplying its embedding by three learned weight matrices: a query $q$ (what am I looking for?), a key $k$ (what do I offer?), and a value $v$ (what will I pass on if selected?). Stacking these across all $n$ tokens gives matrices $Q$, $K$, $V$.

The attention output is:

\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V\]

Read it left to right. $QK^\top$ is an $n \times n$ matrix of dot products: entry $(i, j)$ scores how much token $i$'s query matches token $j$'s key. Dividing by $\sqrt{d_k}$ (where $d_k$ is the key dimension) keeps those dot products from growing large enough to push the softmax into a near-one-hot regime where gradients vanish; this is the "scaled" part, and it is not cosmetic. The softmax turns each row into a probability distribution over all tokens. Multiplying by $V$ takes, for each token, a weighted average of every token's value vector, weighted by relevance.

The result: each token's output is a blend of information pulled from wherever in the sequence it found relevant. That is the whole trick. There is no recurrence, no fixed bottleneck; the relationship between token 1 and token 1,000 is one dot product, computed in parallel with every other pair.

Why "multi-head"

A single attention operation forces every token to summarize all its relationships into one weighted average. That is lossy. Multi-head attention runs several attention operations in parallel (8 heads in the original base model, 12 to 25 in GPT-2), each with its own smaller $Q$, $K$, $V$ projections of dimension $d_k = d_{model}/h$. One head might learn to track syntactic agreement, another to link pronouns to their referents, another to attend to the immediately preceding token. The heads' outputs are concatenated and projected back to $d_{model}$. Crucially, the head dimension stays fixed at 64 across all GPT-2 sizes; bigger models get more heads and more layers, not fatter ones.

The block, assembled

Each Transformer block wraps attention and a position-wise feed-forward network (two linear layers with a GELU nonlinearity in GPT-2, Hendrycks and Gimpel, 2016) in two pieces of structural glue:

Residual connections: the input of each sub-layer is added back to its output, so gradients have a clean path through dozens of layers (He et al., 2015).
Layer normalization (Ba et al., 2016): stabilizes the scale of activations. The original Transformer placed it after each sub-layer ("post-LN"); GPT-2 moved it to the input of each sub-block and added a final normalization after the last block ("pre-LN"). This was not a detail. Pre-LN makes very deep Transformers trainable without the careful learning-rate warmup post-LN demands, a result later analyzed by Xiong et al. (2020).

The one difference that defines everything: masking

Here is the hinge of the whole essay. The encoder and the decoder use the same self-attention operation, with one change. In the decoder, before the softmax, every score $(i, j)$ where $j > i$ is set to $-\infty$. After the softmax, those positions become exactly zero. A token can attend to itself and everything before it, never to anything after.

Why? Because the decoder's job is to generate. When predicting token $i+1$, it must only use tokens $1 \ldots i$; if it could see token $i+1$ during training, it would learn to cheat, copying the answer instead of predicting it. The mask enforces causality. The encoder has no such constraint: it is reading a complete input, so every token may see every other token in both directions. That bidirectionality is exactly what makes encoders good at understanding (BERT's domain) and useless at generation.

GPT-2's entire architecture follows from choosing the masked, generative configuration and committing to it.

Seeing It in Motion

The generation loop is where decoder-only design shows its character. GPT-2 produces text autoregressively: it predicts one token, appends it to the input, and runs the whole stack again.

sequenceDiagram
  participant U as Prompt
  participant E as Embed + position
  participant D as Decoder stack (12-48 blocks)
  participant H as Output head (tie to embeddings)
  participant S as Sampler

  U->>E: "The cat sat on the"
  E->>D: token + position vectors
  loop Each new token
    D->>D: masked self-attention + FFN per block
    D->>H: final hidden state (last position)
    H->>S: logits over 50,257 tokens
    S->>E: pick "mat", append to sequence
  end
  Note over D,S: Stop at end-of-text token or length limit

Notice that the output projection ties its weights to the input embedding matrix, a parameter-saving trick GPT-2 inherited: the same 50,257 by $d_{model}$ matrix maps tokens in and logits out.

The masking pattern itself is easiest to see as a flow over the attention matrix. Each row is a token's query; each column a token it might attend to.

flowchart TD
  A["Compute QKᵀ<br/>(n × n scores)"] --> B["Apply causal mask<br/>set j > i to -∞"]
  B --> C["Scale by 1/√d_k"]
  C --> D["Row-wise softmax<br/>future cols → 0"]
  D --> E["Multiply by V<br/>weighted sum of past values"]
  E --> F["Concat heads, project to d_model"]
  F --> G["Add residual, layer-norm"]

[IMAGE: A 6x6 attention matrix rendered as a heatmap with the upper triangle blacked out, captioned "causal mask: each token sees only itself and the past," with one row highlighted to show its softmax distribution over prior tokens]

By the Numbers

The architecture's shape is set by a handful of numbers, and the GPT-2 family is a clean controlled experiment: identical design, four scales.

Model	Layers	$d_{model}$	Heads	Parameters	Context	Notes
Transformer base (2017)	6 enc + 6 dec	512	8	~65M	—	BLEU 28.4 EN→DE, 41.8 EN→FR on WMT 2014
GPT-2 Small	12	768	12	117M	1,024	Matches GPT-1 depth
GPT-2 Medium	24	1,024	16	345M	1,024
GPT-2 Large	36	1,280	20	762M	1,024
GPT-2 XL	48	1,600	25	1,542M	1,024	Held back at first release for safety

Sources: parameter counts, layer and width figures from Radford et al., 2019 and the GPT-2 model card; Transformer base figures and BLEU scores from Vaswani et al., 2017.

The numbers that define the cost are sharper. Self-attention over a sequence of length $n$ with model dimension $d$ requires building an $n \times n$ score matrix, so:

Quantity	Self-attention	RNN	Why it matters
Compute per layer	$O(n^2 \cdot d)$	$O(n \cdot d^2)$	Attention wins when $n < d$; loses for long sequences
Sequential operations	$O(1)$	$O(n)$	Attention parallelizes fully across the sequence
Memory (naive)	$O(n^2)$	$O(n)$	The $n^2$ score matrix is the bottleneck
Max path length	$O(1)$	$O(n)$	Any two tokens interact in one step

Complexity figures from Table 1 of Vaswani et al. (2017). The $O(1)$ path length is the deep reason attention learns long-range dependencies that RNNs struggle with: there is no chain for the signal to decay along.

GPT-2 was trained on WebText: 40 GB of text, roughly 8 million documents drawn from 45 million web pages, collected by scraping every outbound link from Reddit posts with at least 3 karma before December 2017, a cheap proxy for "a human found this worth sharing" (Radford et al., 2019).

A Concrete Example

Walk one attention computation by hand. Take the four-token sequence "the cat sat down" and focus on the masked self-attention for the word sat (position 3). For legibility, use a tiny $d_k = 2$ and made-up but plausible vectors.

After projection, suppose the query for sat and the keys for all four tokens are:

Token	Position	Key $k$	Value $v$
the	1	(1.0, 0.0)	(0.2, 0.1)
cat	2	(0.2, 0.9)	(0.9, 0.4)
sat	3	(0.8, 0.3)	(0.5, 0.7)
down	4	(0.1, 1.0)	(0.6, 0.6)

Query for sat: $q = (0.9, 0.4)$.

Step 1, raw scores ($q \cdot k$ for each token):

the: $0.9(1.0) + 0.4(0.0) = 0.90$
cat: $0.9(0.2) + 0.4(0.9) = 0.54$
sat: $0.9(0.8) + 0.4(0.3) = 0.84$
down: $0.9(0.1) + 0.4(1.0) = 0.49$

Step 2, scale by $1/\sqrt{2} \approx 0.707$: $(0.636, 0.382, 0.594, 0.347)$.

Step 3, causal mask. sat is at position 3, so down (position 4) is in the future. Its score becomes $-\infty$. The vector is now $(0.636, 0.382, 0.594, -\infty)$.

Step 4, softmax over the three visible tokens. Exponentiate: $(1.889, 1.465, 1.811, 0)$; sum of the first three is $5.165$. Weights:

the: $0.366$
cat: $0.284$
sat: $0.350$
down: $0.000$

Step 5, weighted sum of values:

\[0.366(0.2, 0.1) + 0.284(0.9, 0.4) + 0.350(0.5, 0.7) = (0.50, 0.39)\]

That vector $(0.50, 0.39)$ is sat's output from this attention head. Three things are visible in the trace. The future token down contributed exactly zero, as the mask guarantees. The attention spread itself across the, cat, and sat rather than collapsing onto one, which is typical of a single head. And the output is a genuine blend of context, not a lookup; this is the representation that the next layer refines, and that the output head eventually turns into a probability distribution over what word comes after sat. Stack 48 of these and you have GPT-2 XL.

Where It Breaks

The $n^2$ in the cost table is not an asymptotic curiosity; it is the wall GPT-2 lived against. At 1,024 tokens the attention matrix for a single head holds about a million entries, and it must be materialized for every layer and every head. Doubling the context quadruples that memory. This is why GPT-2's context was frozen at 1,024 tokens, and why every long-context model since has had to attack the quadratic term directly. The most consequential attack, FlashAttention (Dao et al., 2022), did not change the math; it reorganized the computation to avoid writing the full $n \times n$ matrix to slow GPU memory, cutting memory to linear in $n$ and delivering roughly a 3x speedup on GPT-2-scale models. The attention is still $O(n^2)$ in compute, but it no longer pays $O(n^2)$ in memory traffic.

A second failure mode is positional. Because attention is order-blind and GPT-2 learned a fixed embedding for each of its 1,024 positions, the model has no representation for position 1,025. It cannot extrapolate beyond its trained context at all; the architecture simply has no slot. This rigidity is what later relative-position schemes like RoPE (Su et al., 2021) and ALiBi (Press et al., 2021) were designed to fix.

Even within its window, attention is not uniform. Models attend most reliably to the beginning and end of their context and can lose information in the middle, the "lost in the middle" effect documented by Liu et al. (2023). And at the level of behavior, GPT-2 confabulates: trained only to predict plausible next tokens, it has no mechanism distinguishing a true continuation from a fluent-sounding false one. The architecture optimizes likelihood, not truth, and nothing in the design corrects that.

Alternative Designs

GPT-2's decoder-only choice was one of three forks taken from the 2017 architecture in 2018-2019. Each keeps self-attention and discards a different part.

Architecture	Attention	Trained to	Best at	Example
Encoder-decoder	Bidirectional encoder + masked decoder + cross-attention	Map input sequence to output sequence	Translation, summarization with distinct in/out	Original Transformer, T5
Encoder-only	Bidirectional self-attention	Fill masked-out tokens	Classification, retrieval, understanding	BERT (Devlin et al., 2018)
Decoder-only	Masked (causal) self-attention	Predict the next token	Open-ended generation, in-context learning	GPT-1/2/3, Llama, Claude

The encoder-only camp, led by BERT, bet that bidirectional context is worth giving up generation, and for understanding tasks in 2018 it was demonstrably right. The encoder-decoder camp argued that many tasks genuinely have separate inputs and outputs, and T5 (Raffel et al., 2019) showed you can frame nearly everything that way. Decoder-only looked like the weakest bet: no bidirectional context, the simplest possible objective. It won anyway, for a reason that was not obvious in 2018. A next-token predictor can be handed any task as a text prompt, including tasks it was never trained on, because predicting the continuation of "Translate to French: hello →" is just more next-token prediction. The objective that looked limiting turned out to be the most general. GPT-2 was the first clear demonstration of that, and GPT-3 made it undeniable.

How It Is Used in Practice

The decoder-only Transformer is now the substrate of the commercial LLM industry. GPT-3 (Brown et al., 2020) scaled GPT-2's exact architecture by roughly 100x to 175 billion parameters and added almost nothing structurally new; the GPT-4 generation, Anthropic's Claude, Meta's Llama, Google's Gemini, and Mistral's models are all recognizably the same skeleton. An engineer who understands GPT-2's block understands the core of all of them.

Production has bent the architecture in a few consistent ways. The KV cache is universal: during generation, the key and value vectors for past tokens never change, so they are computed once and stored, turning each new-token step from an $O(n^2)$ recomputation into an $O(n)$ append. This single optimization is what makes interactive generation affordable. Serving systems then fight over the memory that cache consumes; PagedAttention, the idea behind the widely used vLLM server (Kwon et al., 2023), manages KV-cache memory like an operating system manages pages, sharply raising throughput. Context windows have grown from GPT-2's 1,024 tokens to hundreds of thousands by combining FlashAttention, relative position encodings, and training tricks, though the underlying compute cost still climbs with length.

The other practical shift is what happens after pretraining. GPT-2 was released as a raw next-token predictor. Modern deployments add instruction tuning and reinforcement learning from human feedback on top of the same architecture, aligning the model's fluent-but-indifferent generation with what users actually want. The Transformer block did not change; the training recipe around it did.

[IMAGE: Diagram of KV-cache growth during autoregressive generation, showing cached key/value tensors accumulating column by column while only the newest query is computed each step]

Insights Worth Remembering

The encoder and decoder were never separate inventions. They are the same self-attention block in two configurations, bidirectional and causal, and the entire 2018-2019 architectural split is a choice of which configuration to keep.
Masking is the most consequential single line of code in the architecture. Setting future scores to $-\infty$ is what converts "understand" into "generate."
Attention's superpower is constant path length. Any two tokens interact in one operation, which is why Transformers capture long-range structure that RNNs lose to gradient decay.
The same property that makes attention powerful, all-pairs interaction, is what makes it expensive: the $n^2$ score matrix is simultaneously the source of the model's reach and the ceiling on its context.
GPT-2's contribution was less an idea than a demonstration: that a next-token predictor, scaled up, acquires capabilities nobody trained into it. Architecture held still while scale did the work.
The "limiting" choice won. Decoder-only looked like the weakest of the three forks and became the default precisely because next-token prediction is the most general possible training task.

Open Questions

The quadratic cost of attention remains genuinely unsolved, not merely engineered around. FlashAttention removed the memory penalty but not the compute one; whether a sub-quadratic mechanism can match full attention's quality is still contested. State-space models like Mamba (Gu and Dao, 2023) achieve linear scaling and competitive results on some benchmarks, but whether they fully match attention on tasks demanding precise long-range recall is, as of mid-2026, an open empirical question rather than a settled one.

A second open problem is interpretability. We can write down exactly what an attention head computes, yet for a 1.5-billion-parameter model we mostly cannot say what a given head has learned to do or why the network produces a particular output. The mechanism is transparent; the learned function is not. Progress in mechanistic interpretability has identified specific circuits in small models, but a complete account of even GPT-2 Small remains out of reach.

Finally, there is the question GPT-2 first raised and no one has closed: how far does scaling go? The smooth improvement with size that GPT-2 hinted at, and Kaplan et al. (2020) formalized into power-law scaling, has held across several orders of magnitude. Whether it continues, plateaus, or requires architectural change to break through is the central empirical bet of the field, and the evidence so far is consistent with continued returns but cannot prove they are unbounded.

From Encoder-Decoder to GPT-2: How the Transformer Learned to Just Decode