Credit Assignment over Long Generations

A language model generating a 500-token proof receives exactly one reward signal: correct or incorrect. Every intermediate token - a poorly placed bracket, an algebra slip, a wrong branch at step 3 - is equally invisible to that signal. This is the credit assignment problem, and in long-generation settings it is considerably worse than in classic RL: the action space is a vocabulary of 50,000+ tokens, episodes are hundreds of steps long, and the "environment" is entirely internal to the model itself.

Why Long Generations Make Credit Assignment Hard

In classic tabular RL (gridworlds, Atari), the credit assignment horizon is short enough that Monte Carlo returns or TD(lambda) bootstrapping can distribute reward reliably. Language generation breaks three comfortable assumptions at once.

Episode length. A single response might span 512 to 8192 tokens. With a discount factor gamma = 0.99 over 1000 steps, the discounted return at step 1 is 0.99^999 ≈ 0.00004 of the reward at step 1000. Setting gamma = 1 (undiscounted) is the standard compromise in language RL, which means every token in a bad response receives the same negative signal regardless of whether it caused the failure.

Delayed, sparse, scalar rewards. Human preference rewards are collected once per response, not per sentence or paragraph. Verifiable rewards (correctness on a maths problem) are also binary and terminal. The model must infer from a single number which of its 500 decisions was the crucial one.

No external state. In game RL the environment transitions provide implicit credit signals (dying reduces life count immediately). In text generation the "environment" is the autoregressive context; there is no state change that can localise the error.

The consequence: training with only terminal rewards is very high-variance. The gradient estimator

∇J(θ) ≈ (1/N) Σ_n [ Σ_t ∇ log π_θ(a_t | s_t) ] · R_n

has variance that grows roughly linearly with episode length, because the same scalar R_n multiplies every per-token log-probability gradient in the episode.

The KL-Regularised Objective and What It Does to Credit

The standard RLHF objective (as used in InstructGPT, Ouyang et al. 2022) adds a per-token KL penalty against the supervised fine-tuned reference policy:

J(θ) = E_x~D [ E_y~π_θ(y|x) [ r(x, y) ] ] - β · KL[ π_θ(·|x) || π_ref(·|x) ]

The KL term decomposes as a sum over tokens: KL = Σ_t log π_θ(a_t) - log π_ref(a_t). This gives the optimiser a dense, per-step signal even when the scalar reward r(x, y) is terminal. The KL penalty acts as a soft anchor: any token that diverges strongly from the reference distribution pays a cost immediately, which effectively provides a coarse credit signal that says "this token distribution moved a lot; be sure the terminal reward justifies it."

Why Long Generations Make Credit Assignment Hard

The KL-Regularised Objective and What It Does to Credit

Keep reading with Pro.