Why Token-Level RL Collapses: GSPO and Sequence-Level Importance Sampling

In the summer of 2025, the Qwen team described a failure mode that anyone scaling reinforcement learning on large language models eventually meets: training proceeds, rewards climb, and then the model collapses. Not degrades. Collapses, irreversibly, in a way that surviving a rollback to an earlier checkpoint and retuning every hyperparameter does not fix (Zheng et al., 2025, Group Sequence Policy Optimization, arXiv:2507.18071). The culprit was not a bad reward model or a learning rate. It was the importance ratio sitting at the center of GRPO, the algorithm behind DeepSeek-R1 and most of the open reasoning models that followed.

The fix is almost embarrassingly small. Stop computing the importance ratio per token. Compute it once per sequence. That single change of granularity is the difference between Group Relative Policy Optimization (GRPO) and Group Sequence Policy Optimization (GSPO), and it is enough to stabilize the RL training of Qwen3's Mixture-of-Experts models without the scaffolding everyone had been bolting on to keep them upright.

Why this matters: Every team running large-scale RLHF or RLVR inherited GRPO's token-level importance ratio without examining whether it is a valid importance-sampling estimator. It is not. The error is invisible on short responses and catastrophic on long ones, which is exactly the regime reasoning models live in.

TL;DR

GRPO applies a separate importance-sampling weight at every token position, but each weight is estimated from a single sample, so it never performs the distribution correction importance sampling promises. It injects variance instead.
That variance accumulates with response length and is amplified by clipping. On long chain-of-thought rollouts it can drive irreversible model collapse.
GSPO defines one importance ratio per sequence, based on the length-normalized sequence likelihood, and clips, rewards, and optimizes at the sequence level. The unit of optimization finally matches the unit of reward.
GSPO removed the need for "Routing Replay," a memory-and-communication-heavy workaround the Qwen team had used to keep MoE RL from diverging because experts re-route between gradient steps.
Counterintuitively, GSPO clips roughly two orders of magnitude more tokens than GRPO and still trains more efficiently, evidence that GRPO's surviving token gradients were noise as much as signal.
The sequence-level ratio is far more tolerant of training-versus-inference numerical mismatch, which opens the door to simpler, disaggregated RL infrastructure.

At a Glance

GSPO keeps the familiar group-relative skeleton of GRPO and changes only the quantity that gets clipped and optimized.

flowchart LR
  Q[Query x] --> R[Old policy samples<br/>group of G responses]
  R --> V[Verifier scores<br/>each response]
  V --> A[Group-relative<br/>advantage]
  R --> S[Sequence likelihood<br/>ratio s_i]
  A --> O[Sequence-level<br/>clip and optimize]
  S --> O
  O --> U[Updated policy]
  class Q,R blue
  class V,A purple
  class S purple
  class O teal
  class U emerald
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

The reward is granted to the whole response. GSPO's argument is that the off-policy correction should be granted to the whole response too.

[IMAGE: Side-by-side schematic of one response. Left panel labels a single importance ratio over the entire token span. Right panel labels one ratio per token. Annotate the right panel with "single sample per position, no averaging."]

Before Sequence-Level Optimization

Reinforcement learning on language models has spent a decade trying to shrink the machinery around the policy gradient. The throughline is the steady removal of expensive auxiliary components.

timeline
  title From TRPO to sequence-level RL
  2015 : TRPO uses a trust region to bound policy updates
  2017 : PPO replaces the trust region with a clipped ratio
  2022 : InstructGPT scales PPO to RLHF with a value model
  2024 : GRPO drops the value model using group-relative advantage
  2025 : DeepSeek-R1 popularizes GRPO for reasoning at scale
  2025 : GSPO moves the importance ratio to the sequence level

Proximal Policy Optimization framed the modern recipe (Schulman et al., 2017, Proximal Policy Optimization Algorithms, arXiv:1707.06347). PPO samples responses from an old policy, then takes several gradient steps on minibatches of that data while a clipped importance ratio keeps each update inside a proximal region of the policy that generated the samples. The clip is what makes off-policy reuse safe. The cost is a value model, typically the same size as the policy, that must estimate per-token advantages and must itself remain reliable as responses grow longer and tasks grow harder.

GRPO removed the value model (Shao et al., 2024, DeepSeekMath, arXiv:2402.03300). Instead of learning a critic, it samples a group of responses to the same query, scores each with a verifier, and uses the group's normalized reward as the advantage. Every token in a response inherits the same scalar advantage. The approach was cheap, it worked, and DeepSeekMath 7B reached 51.7% on the competition-level MATH benchmark without tools or majority voting, a result that approached far larger frontier systems at the time. When DeepSeek-R1 made long-chain reasoning the headline capability (DeepSeek-AI, 2025, DeepSeek-R1, arXiv:2501.12948), GRPO became the default.

What GRPO kept from PPO, without re-examining it, was the token-level importance ratio. PPO needed per-token ratios because it had per-token advantages from a value model. GRPO has one advantage per sequence. It carried the per-token ratio anyway, and that mismatch is the seed of the instability.

How GSPO Actually Works

The objective that breaks

Start with PPO's clipped objective, written here without the KL term for clarity, with the per-token ratio \(w_t(\theta)\):

\[\mathcal{J}_{\text{PPO}}(\theta) = \mathbb{E}\left[\frac{1}{|y|}\sum_{t=1}^{|y|} \min\left(w_t(\theta)\widehat{A}_t,\ \text{clip}(w_t(\theta), 1-\varepsilon, 1+\varepsilon)\widehat{A}_t\right)\right]\]

with

\[w_t(\theta) = \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_{\theta_\text{old}}(y_t \mid x, y_{<t})}.\]

GRPO inherits this form. It swaps the value-model advantage for a group-relative one, shared across all tokens of response \(i\):

\[\widehat{A}_i = \frac{r(x, y_i) - \text{mean}(\{r(x, y_j)\}_{j=1}^G)}{\text{std}(\{r(x, y_j)\}_{j=1}^G)}\]

and otherwise keeps the per-token ratio \(w_{i,t}(\theta)\) inside the same clipped minimum, summed over every token of every response in the group.

Why the token ratio is not importance sampling

Importance sampling has a precise job. To estimate the expectation of a function under a target distribution \(\pi_\text{tar}\) using samples drawn from a behavior distribution \(\pi_\text{beh}\), you reweight:

\[\mathbb{E}_{z \sim \pi_\text{tar}}[f(z)] = \mathbb{E}_{z \sim \pi_\text{beh}}\left[\frac{\pi_\text{tar}(z)}{\pi_\text{beh}(z)} f(z)\right].\]

The identity is exact, but the estimator is only useful when you average the reweighted quantity over many samples from \(\pi_\text{beh}\). The ratio corrects the distribution mismatch in expectation, across a population. A single reweighted sample corrects nothing.

This is the heart of the GSPO critique. At each token position \(t\), GRPO draws exactly one token \(y_{i,t}\) from the old policy's next-token distribution \(\pi_{\theta_\text{old}}(\cdot \mid x, y_{i,<t})\), then applies a one-sample importance weight there. There is no averaging over the position's distribution, so the weight does not perform distribution correction. It behaves as multiplicative noise on the gradient of that token's log-likelihood. The noise has nonzero variance at every one of potentially thousands of positions, it accumulates along the sequence, and the clipping operator, which is nonlinear, amplifies rather than tames it. The Qwen team observed that this can tip a large model into collapse from which training does not recover.

The diagnosis points straight at the fix. The unit of the optimization objective should match the unit of the reward. The reward is a single scalar over the whole response. So the importance ratio should be a single scalar over the whole response.

The sequence-level ratio

GSPO defines the importance ratio from the sequence likelihood, length-normalized:

\[s_i(\theta) = \left(\frac{\pi_\theta(y_i \mid x)}{\pi_{\theta_\text{old}}(y_i \mid x)}\right)^{\frac{1}{|y_i|}} = \exp\left(\frac{1}{|y_i|}\sum_{t=1}^{|y_i|} \log\frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})}{\pi_{\theta_\text{old}}(y_{i,t} \mid x, y_{i,<t})}\right).\]

The objective then clips and optimizes at the sequence level:

\[\mathcal{J}_{\text{GSPO}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^G \min\left(s_i(\theta)\widehat{A}_i,\ \text{clip}(s_i(\theta), 1-\varepsilon, 1+\varepsilon)\widehat{A}_i\right)\right].\]

Two design choices deserve attention. First, the ratio uses the full sequence likelihood \(\pi_\theta(y_i \mid x)\), which is a genuine probability of a genuine sample from the old policy, so the sequence-level weight carries the meaning the token-level weight lacked: it measures how far the sampled response has drifted from the current policy. Second, the exponent \(\frac{1}{|y_i|}\) matters more than it looks. Without length normalization, a likelihood change on a handful of tokens would swing the product wildly, and responses of different lengths would need different clipping ranges. Length normalization compresses \(s_i(\theta)\) into a stable numerical band so a single \(\varepsilon\) works across the batch.

[IMAGE: Histogram of importance-ratio values for one batch, two overlaid distributions. Token ratios spread wide across (0, 2+), sequence ratios bunched tightly around 1.0. Annotate the GSPO clip band as a thin sliver near 1.]

What the gradient does differently

The cleanest way to see the change is in the gradient. GRPO's gradient weights each token's score function by that token's own ratio:

\[\nabla_\theta \mathcal{J}_{\text{GRPO}} \propto \mathbb{E}\left[\frac{1}{G}\sum_i \widehat{A}_i \cdot \frac{1}{|y_i|}\sum_t \frac{\pi_\theta(y_{i,t}\mid \cdot)}{\pi_{\theta_\text{old}}(y_{i,t}\mid \cdot)}\nabla_\theta \log \pi_\theta(y_{i,t}\mid \cdot)\right].\]

Those per-token weights range over \((0, 1+\varepsilon]\) for positive advantages and \([1-\varepsilon, +\infty)\) for negative ones. They are not close to uniform, and their irregularity is what compounds. GSPO's gradient applies one weight, \(s_i(\theta)\), to the entire response and lets every token's score function enter with equal footing:

\[\nabla_\theta \mathcal{J}_{\text{GSPO}} \propto \mathbb{E}\left[\frac{1}{G}\sum_i s_i(\theta)\,\widehat{A}_i \cdot \frac{1}{|y_i|}\sum_t \nabla_\theta \log \pi_\theta(y_{i,t}\mid \cdot)\right].\]

Equal weighting across tokens is the whole point. It removes the per-token multiplicative noise that GRPO's design introduces.

[IMAGE: Two annotated gradient equations stacked, GRPO above GSPO, with the per-token weight in the GRPO line highlighted in red and the single shared weight in the GSPO line highlighted in green. Caption: "where the variance enters."]

GSPO-token, when you really need tokens

Some settings, multi-turn RL in particular, want per-token advantages, for example to credit or penalize specific spans differently. GSPO offers a token-level variant that keeps the sequence-level ratio's stability. It defines

\[s_{i,t}(\theta) = \text{sg}[s_i(\theta)] \cdot \frac{\pi_\theta(y_{i,t}\mid \cdot)}{\text{sg}[\pi_\theta(y_{i,t}\mid \cdot)]},\]

where \(\text{sg}[\cdot]\) is the stop-gradient (PyTorch detach). The second factor has numerical value exactly 1, so \(s_{i,t}(\theta)\) equals \(s_i(\theta)\) in value, but its gradient flows through the per-token likelihood, letting the advantage vary by token. When all token advantages are set equal, GSPO-token is numerically identical to GSPO in objective, clipping condition, and gradient. It is the same algorithm with an extra knob, not a different one.

Seeing It in Motion

The instability is a property of the training loop, not a single step. Large rollout batches are split into minibatches, and several gradient updates happen before fresh rollouts arrive. That off-policy gap is real and unavoidable at scale, and it is exactly what the importance ratio is meant to correct.

sequenceDiagram
  participant Inf as Inference engine
  participant Buf as Rollout buffer
  participant Optm as Optimizer
  participant Pol as Policy
  Inf->>Buf: sample G responses per query
  Buf->>Optm: minibatch 1 (off-policy)
  Optm->>Pol: gradient step with ratio correction
  Buf->>Optm: minibatch 2 (more off-policy)
  Optm->>Pol: gradient step
  Note over Buf,Optm: 2 more minibatches per batch
  Pol->>Inf: refreshed weights for next rollout

Each successive minibatch is more off-policy than the last, so the ratio matters most exactly when the gap is widest. With a token-level ratio, that growing gap is metered through thousands of noisy per-token corrections. With a sequence-level ratio, it is metered once, cleanly, per response.

The MoE case sharpens the failure into something concrete. In a sparse model, different experts activate for the same tokens after a gradient step, so the per-token likelihood that GRPO divides by can lurch between \(\pi_{\theta_\text{old}}\) and \(\pi_\theta\) for reasons that have nothing to do with the policy actually changing.

flowchart TD
  T[Same response,<br/>next gradient step] --> E{Experts re-route?}
  E -->|Dense model| D[Token likelihood<br/>stable]
  E -->|MoE, ~10% experts flip| M[Token likelihood<br/>jumps]
  M --> W[Token ratio<br/>fluctuates wildly]
  W --> C[GRPO needs Routing Replay<br/>to converge]
  D --> G[Sequence ratio<br/>stays smooth]
  M --> G
  G --> N[GSPO converges<br/>without workarounds]
  class T,E blue
  class D,G emerald
  class M,W rose
  class C amber
  class N emerald
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff

For the 48-layer Qwen3-30B-A3B model, the team measured roughly 10% of activated experts changing between the old and new policy after each gradient update for the same rollout sample, with the effect growing in deeper models. That volatility is enough to invalidate the token ratios. GSPO sidesteps it because the sequence likelihood, unlike any single token's likelihood, barely moves when a tenth of the experts re-route, since the model's overall language-modeling behavior is preserved.

By the Numbers

The reported comparison uses a cold-start model fine-tuned from Qwen3-30B-A3B-Base, with each rollout batch split into four minibatches, evaluated on AIME'24 (average Pass@1 over 32 samples), LiveCodeBench, and CodeForces Elo (Zheng et al., 2025, arXiv:2507.18071).

Quantity	GRPO	GSPO	Note
Importance ratio unit	per token	per sequence	the core change
Clipping range (left, right)	0.2, 0.27	3e-4, 4e-4	different by orders of magnitude due to ratio definition
Fraction of tokens clipped	baseline	~100x higher	yet GSPO trains more efficiently
MoE convergence	needs Routing Replay	converges without it	removes memory and comms overhead
Expert flip per step (30B-A3B)	~10% destabilizes ratios	tolerated	sequence likelihood is stable
Reward at fixed compute	lower	higher	continuous gains with more compute

Two figures are worth dwelling on. The clipping ranges differ by roughly three orders of magnitude (0.2 versus 3e-4) precisely because the two ratios live on different scales: a length-normalized geometric mean of token ratios sits in a far tighter band than any individual token ratio, so a much smaller \(\varepsilon\) is appropriate. And the clipping-fraction result is the genuinely surprising one. GSPO discards about a hundred times more tokens from gradient estimation than GRPO does, uses fewer tokens for learning as a result, and still converges faster. The most direct reading is that GRPO's extra token gradients were contributing more noise than signal.

A caution on interpretation. These are the developers' own reported curves for a specific 30B MoE setup, not an independent reproduction across architectures. The mechanism generalizes cleanly in theory, but the exact magnitudes are setup-specific and should be read as such.

[IMAGE: Training-reward curves, GSPO versus GRPO, x-axis training compute, y-axis reward, with GSPO tracking above and continuing to rise while GRPO plateaus. Annotate the divergence point.]

[IMAGE: Log-scale plot of fraction of clipped tokens over training steps, two lines separated by about two orders of magnitude, GSPO on top. Caption the counterintuitive result.]

A Concrete Example

Walk a single group through both algorithms. Take a math query with \(G = 4\) sampled responses and binary verifier rewards.

Response	Reward \(r\)	Length	Correct?
\(y_1\)	1	512	yes
\(y_2\)	0	480	no
\(y_3\)	1	620	yes
\(y_4\)	0	540	no

The group mean reward is 0.5 and the standard deviation is 0.5, so the normalized advantages are \(\widehat{A}_1 = \widehat{A}_3 = +1\) and \(\widehat{A}_2 = \widehat{A}_4 = -1\). Both algorithms agree here; the advantage estimate is identical.

Now look inside the correct response \(y_1\), length 512, partway through training so the policy has drifted slightly off the old policy. Suppose 510 of its tokens have ratios \(w_{i,t}\) hovering near 1.00, ordinary and harmless. But two tokens, perhaps a digit and an operator the new policy now finds slightly less likely, have ratios of 1.9 and 0.4.

Under GRPO, those two tokens enter the gradient with weights 1.9 and 0.4 against an advantage of \(+1\). The 1.9 token is pushed almost twice as hard as its neighbors; the 0.4 token is nearly silenced. Multiply that irregularity across thousands of tokens and millions of steps and the gradient develops a persistent, structured jitter that the clip at \(\pm 0.2\) only partly contains, since clipping a fraction of outliers reshapes the estimator rather than smoothing it.

Under GSPO, the response gets one ratio. Compute it as the length-normalized geometric mean. With 510 tokens at log-ratio near 0 and two tokens at \(\log 1.9 \approx 0.64\) and \(\log 0.4 \approx -0.92\), the average log-ratio is approximately \((0.64 - 0.92)/512 \approx -0.0005\), so

\[s_1(\theta) = \exp(-0.0005) \approx 0.9995.\]

The two outlier tokens that whipsawed the GRPO gradient are absorbed into a sequence ratio that sits essentially at 1, comfortably inside even the tiny GSPO clip band. The entire response is then weighted by 0.9995 and its advantage of \(+1\), with every token contributing its score function equally. The pathological per-token weighting simply has nowhere to act.

This is the example in miniature: GRPO's danger is not the average token, it is the variance among tokens, and length-normalized sequence aggregation is what removes it.

[IMAGE: Annotated trace of response y1 as a horizontal token strip, 510 tokens shaded near-1.0 and two outlier tokens marked 1.9 and 0.4, with a callout showing the geometric-mean collapse to s1 = 0.9995.]

Where It Breaks

GSPO is not free of tradeoffs, and treating it as a universal upgrade would repeat the mistake of carrying PPO's machinery into a setting it did not fit.

Sequence-level aggregation throws away within-sequence credit assignment by construction. For tasks where a long response is mostly right but contains one decisive wrong step, a single scalar ratio and a single scalar advantage cannot localize the error. GSPO-token recovers per-token advantages, but the per-token advantages themselves still have to come from somewhere, and group-relative RLVR does not produce them. The honest framing is that GSPO fixes the off-policy correction, not the credit-assignment problem, which remains open.

The tiny clipping range is also a new sensitivity. Values like 3e-4 sit far from the 0.2 the field has internalized, and intuitions tuned on token-level clipping transfer poorly. A range that is too tight starves the update, and because the sequence ratio is length-normalized, the appropriate range couples to typical response length in ways that need recalibration when the task distribution shifts.

Length normalization, the device that makes the ratio stable, can also mask genuine large divergences. A response that has truly moved far from the old policy on a few critical tokens will still show a sequence ratio near 1 if the rest of the tokens are unchanged. Most of the time that is exactly the desired robustness. Occasionally it hides a signal worth acting on.

Finally, the strongest evidence is from one organization, on one model family, with the comparison's hyperparameters chosen by the proposing team. The argument from importance-sampling first principles is convincing and the mechanism is clean, but independent replications across non-MoE architectures and different reward regimes are what will settle how universal the gains are.

Alternative Designs

Sequence-level optimization is one of several responses to GRPO's instability. Others patch the token-level objective rather than replacing its unit.

Approach	Core idea	Strengths	Weaknesses	Best when
PPO	per-token ratio plus a learned value model	well understood, per-token credit	value model cost and reliability at long horizons	dense models with a trustworthy critic
GRPO	drop the critic, group-relative advantage, token ratio	cheap, no value model	token ratio injects variance, collapses on long or MoE training	short responses, dense models, modest scale
DAPO	keep GRPO, fix the engineering	clip-higher, dynamic sampling, token-level loss, overlong shaping; strong reproducible results	still token-level ratio at heart	reproducible dense-model reasoning RL
GSPO	move ratio and clip to the sequence	stable on MoE, removes Routing Replay, infra-tolerant	no within-sequence credit, new clip-range intuitions	large or MoE models, long-CoT, scaling compute
GSPO-token	sequence ratio with per-token advantages	stability plus token-level flexibility	needs a source of per-token advantages	multi-turn or span-credit RL

DAPO is the most instructive contrast (Yu et al., 2025, DAPO: An Open-Source LLM Reinforcement Learning System at Scale, arXiv:2503.14476). It accepts GRPO's token-level ratio and stabilizes long chain-of-thought RL with four targeted techniques, decoupled "clip-higher" bounds, dynamic sampling that filters degenerate groups, a token-level policy-gradient loss, and overlong-reward shaping, reaching 50 points on AIME 2024 with a Qwen2.5-32B base. It is excellent engineering on top of the existing unit of optimization. GSPO makes the opposite bet: change the unit, and several of the symptoms DAPO treats stop arising. The two are not mutually exclusive, but they represent genuinely different theories of where the problem lives.

[IMAGE: A 2x2 positioning chart, axes "token-level vs sequence-level ratio" and "patches the objective vs replaces the unit," placing PPO, GRPO, DAPO, GSPO, and GSPO-token.]

How It Is Used in Practice

GSPO's most consequential production use is the one it was built for: the large-scale RL training of the Qwen3 models, including the Instruct, Coder, and Thinking variants, where the team credits it as the algorithmic cornerstone behind their reported improvements (Qwen Team, 2025, Qwen3 Technical Report, arXiv:2505.09388). The clearest operational win is the elimination of Routing Replay for MoE models. Routing Replay cached the old policy's expert-activation pattern and forced the new policy to reuse it when computing token ratios, which restored convergence but cost extra memory and inter-device communication and effectively constrained the model's usable capacity by pinning its routing. Removing that workaround simplifies the training stack and lets the MoE use its full routing freedom.

A production RL stack has a recognizable shape, and GSPO changes where the load falls inside it.

graph TD
  subgraph Rollout
    INF[Inference engine<br/>vLLM or SGLang]
    VER[Verifier<br/>reward]
  end
  subgraph Train
    REC[Likelihood recompute<br/>training engine]
    OPT[Optimizer<br/>Megatron]
  end
  INF --> VER
  VER --> OPT
  INF -. GRPO needs .-> REC
  REC -. token-precision match .-> OPT
  INF == GSPO uses inference<br/>likelihoods directly ==> OPT
  OPT --> INF
  class INF,VER blue
  class REC amber
  class OPT teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff

The second practical benefit is infrastructural. RL systems run a fast inference engine for rollouts (vLLM or SGLang, for example) and a separate training engine (such as Megatron) whose numerical precision differs. The standard defensive move is to recompute the old-policy likelihoods with the training engine so the ratio's numerator and denominator come from the same numerics. Because GSPO depends only on the aggregate sequence likelihood, which averages over hundreds of tokens, it tolerates the per-token precision mismatch and can use the likelihoods the inference engine already produced. That removes a recomputation pass and is especially valuable for partial rollouts, multi-turn RL, and training-inference disaggregated architectures, where the recompute is most awkward. The instability that prompted GSPO is not unique to Qwen; collapse during long RL training has been reported elsewhere in the open-model community as a recognized scaling hazard (MiniMax, 2025, MiniMax-M1, arXiv:2506.13585).

[IMAGE: Before/after bar comparison of MoE training cost per step, GRPO-with-Routing-Replay versus GSPO, split into compute, extra memory, and cross-device communication, with the replay overhead bars vanishing under GSPO.]

Insights Worth Remembering

The bug was conceptual, not numerical. GRPO's token ratio is a one-sample importance weight, and importance sampling needs a population to mean anything. No amount of careful implementation makes a single-sample correction correct.
Match the unit of optimization to the unit of reward. When the reward is per sequence, per-token reweighting introduces a granularity mismatch that surfaces as variance.
Variance, not bias, is the killer at scale. The token weights are unbiased on average, but their spread accumulates over length and is amplified by the nonlinear clip until the policy collapses.
Length normalization is doing quiet, essential work. It is what keeps a product of many token ratios inside a single usable clipping band and decouples the clip range from response length.
Clipping more can train better. GSPO discarding a hundred times more tokens while learning faster is strong evidence that some token-level gradients were actively harmful.
Stability can replace infrastructure. Routing Replay existed only to prop up an unstable estimator. Fix the estimator and the scaffolding becomes unnecessary, which is a better outcome than a more elaborate scaffold.
Sparse models punish fragile estimators first. MoE expert volatility did not create the token-ratio problem; it exposed an instability that was latent in dense training too.

Open Questions

The strongest established result is mechanistic: token-level importance ratios are not valid importance-sampling estimators in the single-sample regime, and sequence-level ratios are, which the gradient analysis makes precise. What remains genuinely open is breadth of evidence. The headline comparisons come from the proposing team on the Qwen3 MoE family with self-selected baselines, so independent reproductions on dense models, on different reward structures, and at other scales are the natural next step and have not yet accumulated.

Credit assignment is the deeper unresolved question. GSPO deliberately operates at the sequence level, which is the right move for off-policy correction but says nothing about which tokens in a long reasoning trace deserve credit or blame. Whether process rewards, learned per-token advantages, or GSPO-token with a good advantage source can recover fine-grained credit without reintroducing the variance GSPO removed is, at present, an open line of work rather than a settled answer.

It is also unclear how the choice interacts with other axes of scaling. As reasoning lengths stretch toward tens of thousands of tokens and multi-turn agentic RL becomes common, the accumulation effect that motivates GSPO should grow more severe, which would widen its advantage, but agentic settings also want exactly the per-turn credit assignment sequence-level aggregation gives up. The likely outcome is hybrid objectives that keep the sequence-level ratio for stability while layering structured, lower-variance advantages on top. That is an expectation grounded in the mechanism, not a measured result.