Foundations
Reasoning RL and R1-Style Training
How DeepSeek-R1 and Kimi k1.5 demonstrated that pure reinforcement learning on verifiable rewards can elicit chain-of-thought reasoning in LLMs without any human-labelled reasoning traces.
advanced · 9 min read · Premium
When DeepSeek released R1 in January 2025, the first public shock was the benchmark numbers. The quieter shock, on reading the paper, was the training recipe: no human-annotated reasoning trajectories, no teacher distillation for the RL phase, just a language model generating answers and a binary reward signal saying "correct" or "wrong." Self-reflection and chain-of-thought appeared to emerge from that signal alone. The question this concept addresses is why that works, and what it implies for post-training methodology more broadly.
RLHF Is Already RL
It is worth being precise about vocabulary before the terminology splits into RLHF, RLVR, GRPO, and R1-style training.
Keep reading with Pro.
You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.