← Concept library

Applied LLMs

Power, Thermals, and Clock Throttling

GPU accelerators operate under hard power and thermal budgets that silently reduce clock speeds mid-workload, making sustained throughput lower than peak spec sheets advertise.

intermediate · 8 min read

An NVIDIA H100 SXM5 is rated at up to 700 W. A single training node with eight of them draws up to 5.6 kW - more than a typical household circuit. That figure is not a curiosity; it is a hard constraint that shapes every scheduling, cooling, and performance decision in a data centre. If you have ever profiled a GPU and been puzzled because measured throughput fell 15-30% below the roofline estimate for a clearly compute-bound kernel, power and thermal throttling are often the explanation that is invisible in naive profiling.

The Physics: Where the Watts Go

Every transistor switching state dissipates energy. Dynamic power scales as:

P_dynamic ≈ α · C · V² · f

where α is the switching activity factor, C the total switched capacitance, V the supply voltage, and f the clock frequency. Two things matter here. First, power scales with the square of voltage. Second, frequency and voltage are coupled: to run stably at a higher frequency you must raise voltage, so a 10% frequency increase may require a 5% voltage increase, which adds roughly 10% more power via V² - turning a modest frequency gain into a significant power increase.

Modern GPUs also dissipate static (leakage) power that grows with temperature and transistor count, though it is typically a smaller fraction than dynamic power under full compute loads.

The thermal consequence is straightforward: power dissipated inside the chip must exit as heat through the package, TIM (thermal interface material), heatsink, and ultimately into ambient air or coolant. The junction temperature - the temperature measured at the hottest point on the die - is governed by:

T_junction = T_ambient + P_total · θ_ja

where θ_ja is the junction-to-ambient thermal resistance. Modern data centre GPUs target a maximum junction temperature of around 83-87°C. Exceeding it risks gate-oxide degradation and, in extremes, permanent damage.

Boost Clocks, TDP Envelopes, and the Frequency-Power Trade-off

GPU vendors publish two clock figures: a base clock and a boost clock. The base clock is the guaranteed sustained frequency within the TDP envelope. The boost clock is the opportunistic maximum reached when thermal headroom and instantaneous power allow it.

In practice, a GPU entering a long training run will:

  1. Start at or above boost clock for the first few seconds while the die is still cool.
  2. As junction temperature climbs toward the thermal limit, the power management unit (PMU) steps down frequency in discrete increments to keep power within the rated TDP.
  3. The chip settles at a sustained clock - somewhere between base and boost - where the equilibrium between power dissipated and cooling capacity is stable.

The gap between boost and sustained clock can easily be 10-20% on air-cooled consumer GPUs and 5-10% on data centre GPUs with liquid cooling. A 10% clock reduction on a compute-bound kernel directly translates to 10% lower FLOP/s throughput. This is why benchmark comparisons that measure only peak (burst) performance mislead.

On NVIDIA GPUs, the nvmlDeviceGetCurrentClocksThrottleReasons() function - part of the NVML API - returns a bitmask enumerating the active throttle causes. The flags include:

Flag name Meaning
hwThermal Junction temperature at limit
swPowerCap Software power limit reached
hwSlowdown Hardware slowdown due to thermal/power
syncBoost Clocks synchronised across GPUs in a sync-boost group
displayClockSync Clock held for display output stability

Running nvidia-smi -q -d PERFORMANCE exposes the same reasons at the command line. The swPowerCap flag is the most actionable: it means the GPU has more thermal headroom but its configured power limit is the binding constraint.

Power Capping as a Lever

Data centre operators routinely set a power cap below TDP via nvidia-smi --power-limit=<watts>. This has nuanced effects. Because the V²·f relationship is nonlinear, reducing the power budget by 20% does not uniformly cost 20% performance. Workloads with high arithmetic intensity (compute-bound, e.g., large GEMMs) are sensitive to clock frequency and scale close to linearly. Workloads with low arithmetic intensity (memory-bound, e.g., small batch inference) are often limited by memory bandwidth, which scales weakly with core clock, so their throughput barely changes under modest power caps.

Empirically, research benchmarking H100 and H200 GPUs under varying power caps found that "no universal optimal power cap exists, as the efficiency peak varies across application types and GPU architectures" (Mayr et al., arXiv 2603.16164, 2026). For LLM pre-training (large matrix multiplications, compute-bound), running at 80% of maximum TDP can achieve 90-95% of peak throughput while cutting energy draw measurably. For low-batch inference (memory-bound), even aggressive power caps cause negligible throughput loss, yielding large efficiency gains.

The practical implication: if you are running heterogeneous workloads on a shared cluster, blanket power caps are blunt instruments. Per-job or per-GPU caps tuned to the workload's roofline position are far more effective.

Similarly, DVFS (dynamic voltage and frequency scaling) research - for instance DSO (Wang et al., arXiv 2407.13096, 2024) - shows that ML-guided DVFS can raise energy efficiency by ~19% at less than 5% throughput loss by predicting the optimal operating point from kernel characteristics rather than reacting post-hoc.

Reading the Thermal Budget in Practice

When diagnosing a suspected throttle:

# Real-time monitoring at 200ms intervals
nvidia-smi dmon -s pucvmet -d 0.2

# Query throttle reasons on GPU 0
nvidia-smi -i 0 -q -d PERFORMANCE

# Set a power limit (requires root / admin)
nvidia-smi -i 0 --power-limit=400

Key columns to watch in nvidia-smi dmon:
- sm - SM clock in MHz
- mem - memory clock in MHz
- pwr - power draw in Watts
- temp - GPU temperature in Celsius

A sustained drop in sm clock coinciding with temp near 83°C and pwr near the configured limit is the characteristic fingerprint of thermal/power throttling. The mem clock usually stays stable because it is governed by a separate power domain.

For large multi-GPU training runs, also check whether sync-boost is active. NVLink topologies sometimes lock all GPUs in a node to the clock of the slowest one, meaning a single thermal outlier (a GPU with a blocked intake vent, for example) degrades the entire node.

When it Falls Down

Sustained small-batch inference. Throttling budgets are calibrated for sustained full-load. A GPU serving intermittent inference at 30% utilisation will stay cool and run at boost clock, making per-token latency look deceptively fast in testing. Under production load at 90% utilisation, thermals build and clocks drop, increasing latency non-linearly. Latency profiling should always be done at production utilisation, not idle or low-load.

Cooling asymmetry in multi-GPU nodes. GPUs in the same chassis can have 5-10°C temperature differences depending on airflow path. GPU 0 (often first in the airflow path) runs cooler than GPU 7. If you are comparing per-GPU throughput and see systematic variation, check temperatures before debugging software.

Transient workloads masking sustained throttle. Short benchmarks (under 60 seconds) may complete entirely within the boost window before thermals stabilise. Production training runs for hours; the "real" clock is the sustained frequency, not the initial burst. Always benchmark for at least 5-10 minutes under realistic load to capture the thermal steady state.

Software power cap mismatches. After a node reboot or driver update, power caps can reset to default TDP. If a cluster operator previously tuned caps for efficiency and the reset goes unnoticed, the GPU may now run hotter and throttle more aggressively than expected, or conversely use more power than the rack's PDU budget allows. Automated monitoring of nvmlDeviceGetPowerManagementLimit() against expected values is good operational hygiene.

Liquid cooling isn't a free lunch. Direct liquid cooling raises the effective thermal budget considerably (allowing higher sustained clocks), but it introduces new failure modes: coolant leaks, pump failures, and condensation. A pump failure in a liquid-cooled H100 cluster can go undetected for minutes while the GPU heats to a destructive temperature if monitoring is absent.

Further Reading

Sign in to save and react.
Share Copied