Foundations
The Infrastructure of LLM RL
LLM post-training via RL requires four coordinated systems running simultaneously - a policy, a reference model, a reward model, and a value function - and the design choices for each determine both what behaviours emerge and where the training breaks.
advanced · 7 min read · Premium
Training GPT-3 on next-token prediction cost roughly $4.6 million in compute. InstructGPT, the RLHF-tuned successor that outperformed it on human preference, was trained on a 1.3B-parameter model. The gap between the two is not scale; it is the post-training stack. Understanding why requires looking at what RL for language models actually runs in memory.
The Four-Model Problem
Standard deep RL trains one network. RLHF trains four simultaneously:
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.