The FLOPs of a Transformer Forward Pass

GPT-3 costs roughly 350 petaFLOPs to train. That number is not plucked from thin air: it follows directly from counting the multiplications and additions inside a transformer forward pass, then multiplying by the training steps. If you cannot derive that count yourself, roofline analysis, hardware selection, and cost estimation all rest on foundations you cannot see. This concept gives you the derivation from first principles.

What counts as a FLOP

One FLOP is one floating-point multiply-add. Most hardware vendors count a fused multiply-add (FMA) as two FLOPs (one multiply, one add), and most FLOPs budgets in the literature follow that convention. A matrix multiply of shape [M, K] × [K, N] costs 2MKN FLOPs: for each of the MN output elements you do K multiplications and K additions.

That single formula drives almost every number in transformer analysis.

The major contributors in one transformer layer

A standard decoder-only transformer layer (pre-norm, multi-head self-attention followed by a two-layer MLP) has the following components. Let:

B = batch size (tokens processed in parallel)
T = sequence length
d = model hidden dimension (sometimes d_model)
h = number of attention heads
d_h = d / h = per-head dimension
d_ff = 4d = MLP intermediate dimension (the conventional 4x expansion)

Self-attention projections. Each token is projected to queries, keys, and values via weight matrices W_Q, W_K, W_V ∈ R^{d×d}, and the output is projected back via W_O ∈ R^{d×d}. That is four matrix multiplies, each of shape [BT, d] × [d, d], costing 2 * BT * d^2 each. Total for the four projections:

FLOPs_QKV_proj = 4 × 2BTd² = 8BTd²

Attention score computation. For each head, computing QK^T is [BT, d_h] × [d_h, BT], costing 2BT²d_h per head, so 2BT²d total across all heads. Multiplying the softmax weights by V adds another 2BT²d. Total attention scores:

FLOPs_attn_scores = 4BT²d

Note the T² here: this is where long sequences become expensive. At T = 2048 and d = 4096, this term is smaller than the projection term. At T = 32768, it dominates.

MLP block. Two linear layers: [BT, d] × [d, 4d] and [BT, 4d] × [4d, d], each costing 2BT × d × 4d = 8BTd². Total:

FLOPs_MLP = 16BTd²

Per-layer total. Summing and ignoring small terms (layer norm, activation, bias, softmax, which are O(BTd) and negligible at scale):

FLOPs_layer ≈ 24BTd²  +  4BT²d

For most production models T << d (e.g. T = 2048, d = 8192), so the MLP and projection terms dominate and the quadratic attention term is a modest fraction. The rule of thumb that floats around in the literature is:

What counts as a FLOP

The major contributors in one transformer layer

Keep reading with Pro.