← Concept library

Foundations

Credit Assignment over Long Generations

Explains why distributing a single scalar reward back across hundreds of generation steps is the central unsolved tension in RL for language models, and surveys the main strategies used to address it.

advanced · 9 min read · Premium

A language model generating a 500-token proof receives exactly one reward signal: correct or incorrect. Every intermediate token - a poorly placed bracket, an algebra slip, a wrong branch at step 3 - is equally invisible to that signal. This is the credit assignment problem, and in long-generation settings it is considerably worse than in classic RL: the action space is a vocabulary of 50,000+ tokens, episodes are hundreds of steps long, and the "environment" is entirely internal to the model itself.

Why Long Generations Make Credit Assignment Hard

In classic tabular RL (gridworlds, Atari), the credit assignment horizon is short enough that Monte Carlo returns or TD(lambda) bootstrapping can distribute reward reliably. Language generation breaks three comfortable assumptions at once.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied