ZeRO and FSDP

Plain DDP wastes memory: every GPU holds a full copy of the parameters, gradients, and optimiser state. With Adam in mixed precision that is 16 bytes per parameter, replicated N times. The Zero Redundancy Optimiser (ZeRO) from DeepSpeed observed that the data-parallel group can shard each of those three buffers across ranks and pay only extra communication to materialise them on demand. Fully Sharded Data Parallel (FSDP) is PyTorch's native implementation of the same idea, now the recommended path for training models that do not fit on a single GPU.

The three ZeRO stages

Stage	Sharded	Memory per GPU (vs DDP)	Extra comms vs DDP
Baseline DDP	nothing	1.00x	1x AllReduce on grads
ZeRO-1	optimiser state	~0.25x for Adam	same
ZeRO-2	+ gradients	~0.13x	ReduceScatter replaces AllReduce
ZeRO-3	+ parameters	~`1/N` (asymptotic)	+ AllGather on params, forward and backward

The ratios assume Adam optimiser and mixed-precision training where optimiser state dominates. ZeRO-1 alone gets most of the memory saving for a fraction of the engineering cost; ZeRO-3 is what you reach for when even the parameters do not fit.

How FSDP implements ZeRO-3

FSDP groups parameters into "FSDP units" (typically each transformer block). For each unit, during the forward pass:

AllGather the sharded parameters from all ranks so the full unit's weights exist locally.
Compute forward for that unit.
Free the gathered parameters; keep only the local shard.

During backward, for the same unit (in reverse order):

AllGather the parameters again.
Compute backward, producing local gradients for the full unit.
ReduceScatter the gradients: each rank ends up with the summed gradient for its shard only.
Free the gathered parameters and the full gradients.

The optimiser step runs on the local shard with the local gradient and local optimiser state. No rank ever materialises the full model state.

The overlap that makes it tractable

A naive implementation would gather a unit, compute, free, gather the next unit, and so on - serialising comms and compute. FSDP overlaps them: while unit k is computing, the AllGather for unit k+1 is already in flight, and the ReduceScatter for unit k-1 is also overlapping. With well-tuned bucket sizes and a healthy NVLink / InfiniBand fabric you can hide most of the extra communication behind compute.

forward:    [gather 0][compute 0][free 0]
                      [gather 1][compute 1][free 1]
                                [gather 2][compute 2][free 2]
                                          ...

Communication that is not hidden becomes your bottleneck. Profile with nsys or PyTorch's profiler before adding more parallelism dimensions.

Why FSDP often beats tensor parallelism

Megatron-style tensor parallelism (TP) shards individual matrix multiplications across GPUs and synchronises with AllReduce inside every transformer layer. It works well within a single node where NVLink bandwidth is high, but degrades sharply over InfiniBand. FSDP's communication is at FSDP-unit granularity (whole layers), which packs into larger messages and tolerates slower interconnects much better.

The three ZeRO stages

How FSDP implements ZeRO-3

The overlap that makes it tractable

Why FSDP often beats tensor parallelism

Keep reading with Pro.