Applied LLMs
Collective Communication Primitives
The six core multi-GPU communication patterns (broadcast, reduce, all-reduce, all-gather, reduce-scatter, all-to-all) determine whether a distributed training job spends most of its time computing or waiting on the wire.
intermediate · 8 min read
Scaling a transformer from one GPU to 512 requires moving gradients, activations, and parameters across interconnects thousands of times per training step. Get the communication pattern wrong and you halve throughput; get it right and the network effectively disappears. Collective communication primitives are the small vocabulary from which all of that coordination is built.
The six primitives and what they move
Every distributed job is wired together using variants of six operations. The table below shows what each does in one sentence, using P ranks each holding a buffer of size N.
| Primitive | Input | Output | Traffic per rank |
|---|---|---|---|
| Broadcast | root holds N | all P ranks hold same N | N (receive only) |
| Reduce | P ranks each hold N | root holds element-wise sum (or max, etc.) | N send + N receive |
| All-Reduce | P ranks each hold N | all P ranks hold element-wise result | 2N(P-1)/P |
| All-Gather | P ranks each hold N | all P ranks hold P*N concatenation | N(P-1) |
| Reduce-Scatter | P ranks each hold P*N | each rank holds N-sized reduced chunk | N(P-1) |
| All-to-All | rank i holds P chunks for each other rank | each rank receives one chunk from every other rank | N(P-1) |
All-Reduce is the workhorse of data parallelism: after each backward pass, every rank holds its own gradient shard, and all-reduce synchronises them so every rank sees the identical gradient before the optimiser step. The formula 2N(P-1)/P is the theoretical minimum traffic per rank for a ring algorithm - each byte travels twice (once in reduce direction, once in broadcast direction) and the ring structure keeps this constant as P grows.
Reduce-Scatter followed by All-Gather is algebraically identical to All-Reduce, but splitting them gives ZeRO and Megatron-style optimisers a place to insert computation (the optimiser step itself) between the two halves. This is the basis for ZeRO-1 and ZeRO-2: scatter the gradients, update the shard you own, then gather the updated parameters.
All-Gather alone drives tensor parallelism's activation synchronisation. In Megatron-LM's column-parallel linear layer, each rank computes a column slice of the output; an all-gather reassembles the full activation before the next layer.
All-to-All is rarer in practice but central to Mixture-of-Experts (MoE) routing: tokens assigned to expert E on rank R must be physically shipped to rank R before the expert forward pass. One all-to-all sends them there; a second brings results back.
Ring all-reduce: the algorithm that changed distributed training
Before Horovod popularised the ring algorithm (2018), TensorFlow's parameter-server approach sent all gradients through a small set of PS nodes. With 64 GPUs the PS became a bottleneck; the ring approach eliminated it.
In a ring with P nodes, each node has a left neighbour and a right neighbour. The algorithm proceeds in two phases, each taking P-1 steps:
Phase 1 (reduce-scatter): Each node divides its buffer into P chunks. In step k, node i sends chunk (i-k) mod P to its right neighbour and receives chunk (i-k-1) mod P from its left. After P-1 steps, each node holds one fully-reduced chunk.
Phase 2 (all-gather): The roles reverse. Each node sends its fully-reduced chunk rightward. After P-1 more steps, every node has every chunk.
# pseudo-trace, 4 ranks, buffer split into 4 chunks [a,b,c,d]
# Phase 1, step 1:
# rank0 sends chunk a to rank1, receives chunk d from rank3
# rank1 sends chunk b to rank2, receives chunk a from rank0
# ...
# After 3 steps each rank owns one fully-summed chunk.
# Phase 2 mirrors this, broadcasting the summed chunks around.
Total data sent per rank: 2 * N * (P-1)/P, which converges to 2N as P grows. This is bandwidth-optimal: you cannot reduce the sum of N numbers across P nodes by moving fewer than 2N bytes per node in the worst case.
Topology matters: NVLink, InfiniBand, and the hierarchy
The ring algorithm assumes all links are symmetric. Real clusters are not.
Within a single 8-GPU server, NVIDIA NVLink provides ~600 GB/s bidirectional bandwidth between any pair of GPUs (NVLink 4.0, H100 SXM). Across servers, InfiniBand HDR provides roughly 400 Gb/s per link - about 40x narrower. A naive ring that routes a gradient across the inter-node link at every step squanders intra-node bandwidth.
NCCL (NVIDIA Collective Communications Library) solves this with a two-level hierarchy. It detects the topology at initialisation and builds a tree or ring that keeps as much traffic as possible inside the fast NVLink domain. Only the partial sums (one chunk per server) cross the slower InfiniBand fabric. This is called a hierarchical all-reduce.
The practical implication: profiling a job that feels communication-bound should begin with NCCL_DEBUG=INFO to see which algorithm NCCL selected and whether it detected the topology correctly. A misconfigured container (missing --privileged flag, wrong NUMA affinity) can cause NCCL to fall back to a slower tree or socket-based transport.
Overlap: hiding latency behind computation
Raw bandwidth is only half the story. A training step has a backward pass (compute) and a synchronisation phase (communicate). Running them sequentially wastes the network while the GPU computes, and wastes the GPU while the network transmits.
PyTorch DDP solves this with gradient bucketing. Gradients are accumulated into buckets as the backward pass proceeds layer by layer. As soon as a bucket is full, the all-reduce for that bucket launches asynchronously while the backward pass continues on earlier layers. The default bucket size is 25 MB; tuning it is one of the first levers when a job is network-bound.
import torch.distributed as dist
# DDP launches allreduce per-bucket automatically.
# To tune bucket size:
model = torch.nn.parallel.DistributedDataParallel(
model,
bucket_cap_mb=50, # increase if gradient allreduce is serialised
)
For tensor and pipeline parallelism, where compute and communication cannot be as easily interleaved, frameworks like Megatron-LM and DeepSpeed use explicit async_op=True handles and synchronise only when the result is needed.
When it falls down
Stragglers destroy ring efficiency. A ring all-reduce is as slow as its slowest link. One GPU doing extra memory allocation, a PCIe switch under thermal throttle, or a preempted process on a shared cluster can stall all 512 ranks. Solutions include timeout-based fault detection, elastic training (PyTorch Elastic), and stale-synchronous SGD, each of which trades consistency for resilience.
Small tensors kill bandwidth utilisation. A ring all-reduce over a 1 KB gradient tensor spends most of its time in kernel launch overhead and latency, not in data movement. DDP's bucketing helps, but frameworks that launch one all-reduce per parameter (older Horovod defaults) can waste 90% of available bandwidth on a fast interconnect. Always verify with torch.profiler that kernel launch time is not dominating.
NCCL and InfiniBand disagree on MTU. A mismatch between the InfiniBand network MTU and what NCCL negotiates can cause silent performance regressions rather than failures. Setting NCCL_IB_DISABLE=0 and NCCL_NET_GDR_LEVEL correctly for your fabric topology is cluster-specific and rarely documented well.
All-to-All scales quadratically in traffic volume. In an MoE model with expert parallelism across P ranks, the all-to-all moves O(batch * P) data. Doubling the expert-parallelism degree quadruples the all-to-all cost. Capacity-factor routing (dropping tokens to cap load imbalance) helps but introduces training instability at the decision boundary.
Reduce-Scatter + All-Gather pipelines amplify memory pressure. ZeRO-3 shards parameters, gradients, and optimiser state across ranks. Every forward pass triggers all-gathers to reconstruct parameters layer by layer; every backward triggers reduce-scatters. On very large models the working-set memory per GPU drops to near-minimum, but the number of communication operations per step multiplies, and any deadlock in the dependency graph stalls the entire job silently.
Further reading
- NVIDIA NCCL Collectives User Guide - canonical reference for all-reduce, reduce-scatter, all-gather, and all-to-all semantics with diagrams.
- PyTorch Distributed Communication Package (torch.distributed) - API reference covering collective operations, process groups, and backend selection (NCCL, Gloo, MPI).
- Horovod: fast and easy distributed deep learning in TensorFlow (arXiv 1802.05799) - Sergeev and Del Balso's paper that brought ring all-reduce to mainstream deep learning practice.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (arXiv 1910.02054) - Rajbhandari et al. on how reduce-scatter and all-gather compose to shard optimiser state, gradients, and parameters.