Foundations
Multi-Turn and Agentic RL
Multi-turn and agentic RL extends single-response RLHF to sequences of actions across environment steps, requiring credit assignment, trajectory-level rewards, and new training algorithms suited to long-horizon tool-using agents.
advanced · 9 min read · Premium
Single-turn RLHF treats each model response as a complete episode: one prompt, one completion, one scalar reward. That simplification made InstructGPT tractable in 2022, but it breaks down the moment your model is expected to write code, execute it, observe the result, fix the bug, and repeat for ten iterations. At that point you have a sequential decision problem, and the credit assignment question becomes non-trivial: which of the twelve tool calls in the trajectory actually caused the test suite to pass?
This concept covers what changes - technically and algorithmically - when RL is applied to agents operating across multiple turns.
What Changes in Multi-Turn Settings
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.