Applied LLMs
NVLink and Intra-Node Interconnect
NVLink is NVIDIA's proprietary GPU-to-GPU interconnect that delivers up to 900 GB/s aggregate bandwidth on H100, replacing PCIe as the bottleneck in multi-GPU training by making all-reduce and tensor parallelism far cheaper.
intermediate · 8 min read
A PCIe 5.0 x16 link tops out at roughly 64 GB/s bidirectional. An H100 NVLink fabric delivers 900 GB/s. That fourteen-fold gap is not a marketing quirk; it determines whether you can run tensor parallelism across eight GPUs on a single node at all, or whether the inter-GPU traffic collapses your hardware utilisation to single digits.
This concept unpacks how NVLink achieves those numbers, how NVSwitch turns a collection of point-to-point links into a full bisection-bandwidth fabric, and why the gap between intra-node and inter-node bandwidth shapes every multi-GPU parallelism strategy.
Why bandwidth between GPUs matters
Modern transformer training exposes two main multi-GPU communication patterns:
-
Data parallelism / gradient all-reduce. Each GPU holds a full model replica and a shard of the batch. After the backward pass, gradients are summed across GPUs with an all-reduce. The volume scales linearly with parameter count: a 7 B parameter model in BF16 transfers ~14 GB per step.
-
Tensor parallelism. A single transformer layer is split across GPUs. Every forward pass requires all-gather and reduce-scatter operations on activations, not just on gradients. Communication is on the critical path of every layer, not just at the end of a step.
For gradient all-reduce the bottleneck is bandwidth times latency times frequency. For tensor parallelism the arithmetic intensity of the operation relative to the communication volume determines whether GPUs stall. In both cases, more bandwidth directly converts to GPU utilisation.
A useful proxy: a ring all-reduce over N GPUs sends (2(N-1)/N) * M bytes per GPU, where M is the gradient tensor size. On eight A100s linked by PCIe Gen 4 (~50 GB/s aggregate), the all-reduce for a 70 B model in BF16 (~140 GB) would take over 5 seconds. With NVLink 3 (600 GB/s), that same operation takes under half a second.
NVLink architecture: from link to fabric
NVLink is a differential serial interface. Each link is a collection of sub-links, each sub-link being a bundle of differential pairs. The key progression across generations:
| Generation | GPU | Links per GPU | Bandwidth per link (bidir.) | Total aggregate |
|---|---|---|---|---|
| NVLink 1.0 | P100 | 4 | 40 GB/s | 160 GB/s |
| NVLink 2.0 | V100 | 6 | 50 GB/s | 300 GB/s |
| NVLink 3.0 | A100 | 12 | 50 GB/s | 600 GB/s |
| NVLink 4.0 | H100 | 18 | 50 GB/s | 900 GB/s |
Each link on H100 carries 50 GB/s bidirectional (25 GB/s in each direction). With 18 links, the total is 900 GB/s. NVIDIA quotes this as "7x the bandwidth of PCIe Gen 5", which is accurate for the x16 form factor.
Physically, NVLink bypasses the PCIe bus entirely. The links connect directly to the GPU die (or in the case of NVSwitch-based systems, to the switch chip). This eliminates PCIe's shared-bus contention and its protocol overhead.
NVSwitch: full-mesh without the wiring complexity
In a two-GPU system, direct NVLink connections are straightforward. At eight GPUs, a naive all-pairs topology would need 28 links per GPU pair and a physically impossible wiring harness. NVSwitch solves this.
NVSwitch is a dedicated crossbar switch ASIC. Each switch exposes multiple NVLink ports. In DGX A100, six NVSwitch chips sit between the eight A100s, with each GPU connecting to each switch. The result is a non-blocking all-to-all fabric: any GPU can send to any other GPU at full link bandwidth without contention.
The effective topology from software's perspective is a fully connected graph. An all-reduce that would otherwise traverse a ring (with 2(N-1)/N efficiency loss) can use a tree or fully parallel scatter-gather across the switch fabric, approaching 100% bandwidth utilisation.
For H100-based DGX H100 systems, two generations of NVSwitch (each with 64 NVLink 4.0 ports) are used per node, maintaining the non-blocking property with the higher per-link bandwidth.
How collectives exploit the fabric
NCCL (NVIDIA Collective Communications Library) is the runtime layer that maps collective operations (all-reduce, all-gather, reduce-scatter, broadcast) onto the physical topology. When NCCL detects NVLink connections, it selects intra-node algorithms that maximise the fabric:
- Ring all-reduce is replaced by tree or direct all-reduce patterns that use the NVSwitch crossbar directly, cutting latency and improving bandwidth utilisation.
- Tensor parallelism in frameworks like Megatron-LM issues explicit all-gather and reduce-scatter calls; NCCL routes these as P2P transfers through NVSwitch rather than staging through the CPU.
A rough mental model for latency: NVLink transfers have a latency on the order of microseconds across the switch fabric. PCIe transfers involve CPU-side DMA engines, driver overhead, and potential IOMMU translation, pushing latency to tens or hundreds of microseconds. For small tensors (attention heads in per-layer tensor parallelism), this difference dominates.
The roofline view
Consider a reduce-scatter in a tensor-parallel forward pass. Each GPU must receive its shard and reduce gradients from all others. The arithmetic work per element is trivial (a sum). The bottleneck is therefore memory bandwidth, and specifically interconnect bandwidth.
If interconnect bandwidth B (GB/s) and the per-GPU data volume V (GB) are known, the minimum communication time is:
t_comm = V / B
With B = 900 GB/s (NVLink 4.0 aggregate) and V = 1 GB (a typical intermediate activation), t_comm ~ 1.1 ms. With B = 64 GB/s (PCIe 5.0), t_comm ~ 15.6 ms. If the compute time for the corresponding matrix multiply is 2 ms, tensor parallelism is viable only with NVLink.
This is the roofline argument for intra-node interconnect: it determines whether communication time is a fraction of compute time (pipeline-overlappable) or a multiple of it (blocking).
When it falls down
Eight-GPU hard ceiling. NVLink scales within a single node. Extend your job to two nodes and you cross InfiniBand (or RoCE), where bandwidth drops to 400 Gb/s (~50 GB/s) per link and latency jumps by an order of magnitude. Tensor parallelism beyond one node becomes very expensive, which is why most production systems cap tensor parallel degree at 8 and use pipeline or data parallelism for the cross-node dimension.
NVLink is not RDMA. Despite physically bypassing PCIe, NVLink transfers still require the GPU driver and NCCL to coordinate buffers. Poorly written communication code (e.g., excessive synchronisation, non-contiguous memory layouts) can waste most of the available bandwidth. Profiling with Nsight Systems often reveals that actual observed all-reduce bandwidth is 40-60% of theoretical peak.
Thermal and power density. The NVLink bridges (the physical connectors used in HGX/DGX form factors) draw power and generate heat. High-bandwidth, sustained communication workloads stress the thermal envelope of the node differently than compute-bound workloads. Power-capped cloud instances may throttle the interconnect under sustained communication load.
NVSwitch is a single point of failure topology. In a DGX node, a failed NVSwitch chip degrades or breaks all-to-all connectivity. The entire node typically needs to be taken out of service. There is no graceful degradation path in the way that InfiniBand with multiple paths has.
Inference serving is often not bandwidth-limited. For single-model inference with batch size 1, the memory bandwidth bottleneck is usually the GPU's HBM, not inter-GPU interconnect. NVLink's advantage accrues mainly at training scale or at large-batch decode where tensor parallelism is genuinely necessary. Over-engineering an inference deployment around NVLink topology can add cost without benefit.