Exploration in Language-Model RL

A policy trained with PPO on a maths reasoning task can plateau at 60% accuracy for thousands of gradient steps, then abruptly jump to 75% when a single new solution strategy first appears in a sampled batch. That jump is not a learning event; it is a discovery event. The learning happened in one forward pass once exploration finally produced the right kind of trajectory. This is the core tension in language-model RL: the gradient optimiser is powerful, but it is helpless unless the sampling distribution surfaces useful experience in the first place.

Why standard RL exploration intuitions do not transfer directly

In tabular or low-dimensional continuous RL, exploration means visiting states that have not been visited before. Epsilon-greedy, UCB, and intrinsic curiosity bonuses all operate on the assumption that you can enumerate or measure the novelty of a state.

A language model's "state" is the sequence of tokens generated so far, and its "action space" at each step is the full vocabulary (often 32k to 128k tokens). The combined space of possible completions for a 512-token response is astronomically large; it is never the case that any state is visited twice in any meaningful sense. The exploration problem is therefore not "visit new states" but "produce diverse, structurally varied completions."

The distinction matters practically. Two completions can be token-by-token different yet semantically identical. Measuring diversity at the token level is noisy. Measuring it at the semantic level requires embeddings or auxiliary models, adding computational cost and introducing new failure modes.

Temperature, entropy bonuses, and their limits

The cheapest exploration lever in language-model RL is sampling temperature. Higher temperature flattens the softmax, increasing per-token entropy and making the policy generate more varied outputs. This works early in training when the policy still has a broad distribution. It degrades as training progresses, because the policy concentrates probability mass on a small cluster of high-reward patterns and temperature can no longer surface genuinely different strategies without also surfacing incoherent text.

A more principled version is the entropy bonus, added directly to the reward signal:

r_total(y) = r_reward(y) + alpha * H(pi(. | x))

where H is the entropy of the policy's token-level distribution and alpha controls the exploration-exploitation tradeoff. This appears in maximum-entropy RL frameworks and is related to the temperature term in soft actor-critic. In practice, alpha is small (around 0.01 to 0.1) because large values destabilise the policy and reduce coherence faster than they improve diversity.

The KL penalty relative to the reference policy (the frozen SFT model) serves a related but different function. It prevents the policy from collapsing too far toward any single high-reward completion style. Written as part of the standard KL-regularised objective:

Why standard RL exploration intuitions do not transfer directly

Temperature, entropy bonuses, and their limits

Keep reading with Pro.