Multi-Turn and Agentic RL

Single-turn RLHF treats each model response as a complete episode: one prompt, one completion, one scalar reward. That simplification made InstructGPT tractable in 2022, but it breaks down the moment your model is expected to write code, execute it, observe the result, fix the bug, and repeat for ten iterations. At that point you have a sequential decision problem, and the credit assignment question becomes non-trivial: which of the twelve tool calls in the trajectory actually caused the test suite to pass?

This concept covers what changes - technically and algorithmically - when RL is applied to agents operating across multiple turns.

What Changes in Multi-Turn Settings

In single-turn RLHF, the Markov Decision Process (MDP) is degenerate: one state (the prompt), one action (the full completion), one reward. The KL-regularised objective from prior concepts holds:

J(π) = E_{x~D, y~π}[ r(x, y) ] - β · KL( π(·|x) || π_ref(·|x) )

The moment you extend to multi-turn, the episode becomes a trajectory τ = (s₀, a₀, s₁, a₁, ..., sₙ, aₙ) where each state sₜ is the conversation context so far and each action aₜ is the model's next token sequence (a message, a tool call, a scratchpad step). The reward R(τ) is typically only observed at the end - a test pass or fail, a human score on the final answer, an API response code.

Three structural difficulties emerge immediately:

Sparse reward. A trajectory of 20 turns receives a single terminal signal. Most intermediate steps are unobserved. Naive policy gradient has extremely high variance under these conditions.

Exponential state space. The context window grows with each turn. After ten turns of 200 tokens each, you are conditioning on 2,000 tokens of history. The policy is effectively a different function at every step.

Credit assignment. When the agent misread a tool output at turn 4, the final reward at turn 20 carries the blame, but the gradient flows through all 16 intervening steps equally unless you specifically address this.

Trajectory-Level Reward and Return Decomposition

The natural fix for sparse reward is to learn a value function or a per-step process reward model (PRM) that estimates expected future return from any intermediate state. With a value estimate V(sₜ), you can compute advantages at each step:

Aₜ = Rₜ + γ · V(sₜ₊₁) - V(sₜ)

Generalised Advantage Estimation (GAE) from the robotics RL literature applies directly; the main difference is that the "environment" here includes Python interpreters, web browsers, databases, or any external tool your agent can call.

An alternative grounded in preference learning is the Q-function view of DPO. Rafailov et al. (COLM 2024) showed that the token-level implicit reward in DPO satisfies the Bellman equation, making it equivalent to inverse Q-learning on a token-level MDP. This means step-level credit assignment emerges naturally if you train with turn-level preference pairs instead of response-level ones - though collecting such data is expensive.

What Changes in Multi-Turn Settings

Trajectory-Level Reward and Return Decomposition

Keep reading with Pro.