Gradient Checkpointing, Activation Recomputation, and CPU Offload

Most engineers' mental model of training memory is "model + optimiser state." For long sequences and large batches that picture is wrong: activations dominate. Every layer's forward output has to be retained until its backward uses it, and activation memory grows linearly with depth, batch, and sequence length. Gradient checkpointing trades extra compute for activation memory by recomputing chunks on the backward pass. When even that is not enough, ZeRO-Offload pushes parameters and optimiser state to CPU DRAM or NVMe.

Why activations dominate

For a transformer with L layers, hidden size h, sequence length s, and batch b, the activation memory for a single forward pass is roughly:

activations ~ L * s * b * h * (constant for attention + MLP intermediates)

The constant is large (Megatron's accounting puts it at 34-ish bytes per token per layer in BF16, before sequence-quadratic attention terms). For a 70B model with L=80, h=8192, s=8192, b=4, that is around 700 GB of activations - far more than the 140 GB of weights. Activations are the constraint, not parameters.

Checkpointing: save 50-80% memory at ~30% compute cost

The 2016 sublinear-memory paper made the idea explicit: pick a subset of layers (checkpoints) whose activations you save. For the layers between checkpoints, discard the activations during forward. On the backward pass, when you need those activations, recompute them from the nearest saved checkpoint.

without checkpointing:
    forward:  save activations for every layer
    backward: read saved activations
    memory:   O(L)   compute: 1x forward + 1x backward

with checkpointing every k layers:
    forward:  save activations only at checkpoint layers
    backward: recompute the k-1 intermediate layers, then backward through them
    memory:   O(L/k)   compute: 1x forward + 1x recompute-forward + 1x backward

Checkpointing every layer (most aggressive) gets a roughly 2x activation memory reduction at the cost of one extra forward pass (~33% more compute since backward is roughly 2x forward). Selective checkpointing - only checkpointing the cheap-to-recompute pieces - gets most of the saving for less of the overhead.

PyTorch ships torch.utils.checkpoint.checkpoint (per-call) and checkpoint_sequential (for nn.Sequential blocks). FSDP and DeepSpeed both wire it in at the transformer-block level by default.

from torch.utils.checkpoint import checkpoint

class Block(nn.Module):
    def forward(self, x):
        return checkpoint(self._forward, x, use_reentrant=False)

def _forward(self, x):
        x = self.attn(x)
        x = self.mlp(x)
        return x

Selective activation checkpointing

Not every operation has the same recompute cost. Attention's softmax(QK^T / sqrt(d)) is FLOP-cheap but memory-heavy (it materialises the full s x s attention matrix). The MLP intermediates are FLOP-heavy but smaller. Selective checkpointing saves the cheap-to-recompute tensors and recomputes only the expensive ones.

Megatron-LM's selective recomputation policy saves the inputs to attention and the inputs to MLP, then recomputes both. It cuts activation memory by roughly 5x at only ~5% throughput cost (rather than 30%+ for full checkpointing). FlashAttention does its own variant - the backward kernel recomputes the attention matrix on-the-fly inside the SRAM tile, so the s x s matrix never has to be materialised in HBM at all.

Why activations dominate

Checkpointing: save 50-80% memory at ~30% compute cost

Selective activation checkpointing

Keep reading with Pro.