← Concept library

Foundations

Reasoning RL and R1-Style Training

How DeepSeek-R1 and Kimi k1.5 demonstrated that pure reinforcement learning on verifiable rewards can elicit chain-of-thought reasoning in LLMs without any human-labelled reasoning traces.

advanced · 9 min read · Premium

When DeepSeek released R1 in January 2025, the first public shock was the benchmark numbers. The quieter shock, on reading the paper, was the training recipe: no human-annotated reasoning trajectories, no teacher distillation for the RL phase, just a language model generating answers and a binary reward signal saying "correct" or "wrong." Self-reflection and chain-of-thought appeared to emerge from that signal alone. The question this concept addresses is why that works, and what it implies for post-training methodology more broadly.

RLHF Is Already RL

It is worth being precise about vocabulary before the terminology splits into RLHF, RLVR, GRPO, and R1-style training.

Keep reading with Pro.

You're reading the preview. Unlock the full concept plus the library, study plans, the AI mentor, and daily emails.

Sign in to save and react.
Share Copied