← Concept library

Foundations

GRPO: Group Relative Policy Optimisation

GRPO removes the critic network from PPO by estimating baselines from a sampled group of outputs, halving the GPU footprint while delivering competitive reasoning improvements.

advanced · 8 min read · Premium

Training DeepSeekMath 7B to 51.7% on the MATH competition benchmark without any external tools required something that standard PPO could not deliver cleanly: a policy-gradient update that fits comfortably on the same hardware used for the forward pass. The answer was Group Relative Policy Optimisation (GRPO), introduced by Shao et al. (2024) and subsequently scaled in DeepSeek-R1 to drive the emergent reasoning behaviours that surprised the research community in early 2025.

The core tension GRPO resolves is this: PPO needs a value network (the critic) to compute baselines for variance reduction. For a 7B-parameter policy, training an equally-sized critic alongside it roughly doubles peak memory. GRPO sidesteps this entirely by sampling a group of responses from the current policy for every prompt and computing relative advantages within that group. No separate network; no billion-parameter value head.


Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied