← Concept library

Foundations

The Infrastructure of LLM RL

LLM post-training via RL requires four coordinated systems running simultaneously - a policy, a reference model, a reward model, and a value function - and the design choices for each determine both what behaviours emerge and where the training breaks.

advanced · 7 min read · Premium

Training GPT-3 on next-token prediction cost roughly $4.6 million in compute. InstructGPT, the RLHF-tuned successor that outperformed it on human preference, was trained on a 1.3B-parameter model. The gap between the two is not scale; it is the post-training stack. Understanding why requires looking at what RL for language models actually runs in memory.

The Four-Model Problem

Standard deep RL trains one network. RLHF trains four simultaneously:

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied