GRPO: Group Relative Policy Optimisation

Training DeepSeekMath 7B to 51.7% on the MATH competition benchmark without any external tools required something that standard PPO could not deliver cleanly: a policy-gradient update that fits comfortably on the same hardware used for the forward pass. The answer was Group Relative Policy Optimisation (GRPO), introduced by Shao et al. (2024) and subsequently scaled in DeepSeek-R1 to drive the emergent reasoning behaviours that surprised the research community in early 2025.

The core tension GRPO resolves is this: PPO needs a value network (the critic) to compute baselines for variance reduction. For a 7B-parameter policy, training an equally-sized critic alongside it roughly doubles peak memory. GRPO sidesteps this entirely by sampling a group of responses from the current policy for every prompt and computing relative advantages within that group. No separate network; no billion-parameter value head.

The PPO Baseline Problem

Recall the standard policy-gradient update. For a token \(t\) in output \(o\), the gradient scales with the advantage \(A_t = Q(s_t, a_t) - V(s_t)\). The \(V(s_t)\) term is a baseline that reduces variance without introducing bias. In PPO, a learned critic approximates \(V\). Training the critic well requires:

A full forward-and-backward pass through a large network.
A carefully calibrated value loss (often clipped separately).
Memory for the critic's parameters and its optimizer state.

For language models, the "state" is the entire token prefix, so the critic is typically initialised from a copy of the policy and fine-tuned in lockstep. At 7B parameters with Adam states, that is roughly 56 GB extra at bf16 just for the critic, before activations. The memory wall is real.

How GRPO Works

For each training prompt \(q\), GRPO samples a group of \(G\) outputs \(\{o_1, o_2, \ldots, o_G\}\) from the current (old) policy \(\pi_{\theta_{\text{old}}}\). A reward model (or rule-based verifier) scores each: \(\{r_1, r_2, \ldots, r_G\}\).

Advantage computation. Under outcome supervision, the baseline is simply the group mean. The advantage assigned to every token in output \(o_i\) is:

\[\hat{A}_{i} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}\]

This is group normalisation applied to scalar rewards. No critic needed; the group itself supplies the counterfactual signal ("was this response better or worse than what else the policy would have produced?").

Under process supervision (step-level rewards \(r_i^{(j)}\) for each reasoning step \(j\)), the advantage at token \(t\) accumulates future step rewards, giving finer credit assignment within a chain-of-thought.

The objective. GRPO maximises:

\[\mathcal{J}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[ \min\!\left(\rho_{i,t}\, \hat{A}_{i},\; \text{clip}(\rho_{i,t}, 1-\varepsilon, 1+\varepsilon)\, \hat{A}_{i}\right) - \beta\, \mathbb{D}_{\mathrm{KL}}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] \right]\]

where \(\rho_{i,t} = \pi_\theta(a_t | s_t) / \pi_{\theta_{\text{old}}}(a_t | s_t)\) is the importance-sampling ratio (same as PPO-clip), and \(\beta\) is the KL penalty coefficient.

KL treatment. In standard RLHF with PPO, the KL penalty is added token-by-token to the reward before computing advantages, which contaminates the advantage signal. GRPO instead subtracts the KL term directly from the loss, keeping advantages clean. The KL is estimated with the unbiased approximation \(\mathbb{D}_{\mathrm{KL}}[\pi_\theta \| \pi_{\text{ref}}] \approx \log\frac{\pi_\theta}{\pi_{\text{ref}}} - \left(\frac{\pi_\theta}{\pi_{\text{ref}}} - 1\right)\), which avoids a separate reference-model forward pass just for reward shaping.

The PPO Baseline Problem

How GRPO Works

Keep reading with Pro.