← Concept library

Foundations

Exploration in Language-Model RL

Language-model RL training collapses silently when the policy stops generating diverse completions, and standard RL exploration techniques must be reinterpreted to work inside a token-sequence action space.

advanced · 8 min read · Premium

A policy trained with PPO on a maths reasoning task can plateau at 60% accuracy for thousands of gradient steps, then abruptly jump to 75% when a single new solution strategy first appears in a sampled batch. That jump is not a learning event; it is a discovery event. The learning happened in one forward pass once exploration finally produced the right kind of trajectory. This is the core tension in language-model RL: the gradient optimiser is powerful, but it is helpless unless the sampling distribution surfaces useful experience in the first place.

Why standard RL exploration intuitions do not transfer directly

In tabular or low-dimensional continuous RL, exploration means visiting states that have not been visited before. Epsilon-greedy, UCB, and intrinsic curiosity bonuses all operate on the assumption that you can enumerate or measure the novelty of a state.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied