Applied LLMs
The FLOPs of a Transformer Forward Pass
A systematic derivation of how many floating-point operations a single transformer forward pass costs, and why that number dictates hardware choice, batch strategy, and scaling decisions.
intermediate · 8 min read · Premium
GPT-3 costs roughly 350 petaFLOPs to train. That number is not plucked from thin air: it follows directly from counting the multiplications and additions inside a transformer forward pass, then multiplying by the training steps. If you cannot derive that count yourself, roofline analysis, hardware selection, and cost estimation all rest on foundations you cannot see. This concept gives you the derivation from first principles.
What counts as a FLOP
One FLOP is one floating-point multiply-add. Most hardware vendors count a fused multiply-add (FMA) as two FLOPs (one multiply, one add), and most FLOPs budgets in the literature follow that convention. A matrix multiply of shape [M, K] × [K, N] costs 2MKN FLOPs: for each of the MN output elements you do K multiplications and K additions.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.