Latent Reasoning: Teaching Language Models to Think Without Tokens

A reasoning model solving a hard math problem spends most of its output budget on glue. "Let me work through this step by step." "First, I need to find." "Therefore the answer is." Strip a chain-of-thought trace down to the tokens that actually carry computation, and a surprising fraction of it is connective tissue: grammatical scaffolding that keeps the text fluent but does no arithmetic. Coconut's authors made this observation precise, noting that most word tokens "primarily ensure textual coherence and are not essential for reasoning" (Hao et al., 2024, Training Large Language Models to Reason in a Continuous Latent Space, arXiv:2412.06769).

That raises an uncomfortable question. If the model is going to reason in steps anyway, why force every step to pass through the narrow bottleneck of the vocabulary? Latent reasoning is the family of techniques that answers "it shouldn't have to." Instead of decoding each intermediate thought into a discrete token and reading it back in, the model keeps the thought as a continuous vector and feeds it forward directly. The reasoning happens in the hidden space, where it is wider, cheaper per step, and almost entirely invisible.

Why this matters: Chain-of-thought turned the hidden computation of a transformer into readable text, which is exactly what made reasoning models legible and trainable. Latent reasoning proposes to push that computation back under the surface. The payoff is efficiency and a richer search; the cost is that the most powerful part of the model's reasoning may stop being something you can read.

TL;DR

Token-space chain-of-thought (CoT) reasons by emitting words; latent reasoning reasons by iterating on continuous hidden states and only decodes words at the end.
Coconut feeds the model's last hidden state back as the next input embedding ("continuous thought"). On the ProsQA planning benchmark it reached 97.0% accuracy using about 14 latent steps, against 77.5% for token CoT using about 49 tokens (Hao et al., 2024).
A continuous thought can hold a superposition of several candidate next steps, letting the model approximate breadth-first search instead of committing to one path the way greedy token decoding does.
The recurrent-depth approach (Huginn) scales test-time compute by looping a recurrent block to arbitrary depth, reaching the reasoning quality of a far larger fixed-depth model without producing extra tokens (Geiping et al., 2025, arXiv:2502.05171).
Latent reasoning is not free accuracy. On GSM8K math, Coconut scored below a token-CoT baseline (34.1% vs 42.9%), so the gains are task-dependent, concentrated on search and planning.
The deepest tension is interpretability: filler-token work shows transformers can do hidden computation behind tokens that carry no visible reasoning, which "raises concerns about unauditable, hidden computations" (Pfau et al., 2024, arXiv:2404.15758).

At a Glance

Token reasoning and latent reasoning differ in one structural choice: whether each intermediate thought is forced through the vocabulary or kept as a vector.

flowchart LR
  Q[Question] --> A{Reasoning mode}
  A -->|token CoT| T1[Emit word] --> T2[Read word back] --> T3[Repeat per step]
  A -->|latent| L1[Hidden state] --> L2[Feed state forward] --> L3[Repeat in latent space]
  T3 --> D[Decode answer]
  L3 --> D
  class Q blue
  class A slate
  class T1,T2,T3 amber
  class L1,L2,L3 purple
  class D teal
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The amber path pays a vocabulary projection and a re-embedding on every step. The purple path skips both. Everything else about latent reasoning follows from that single edit to the loop.

[IMAGE: Side-by-side schematic of one reasoning step. Left panel: hidden state to logits to argmax token to embedding lookup, with the lossy bottleneck circled in red. Right panel: hidden state fed straight back as the next embedding, no projection, annotated "full vector preserved".]

Before Latent Reasoning

The modern story starts with chain-of-thought prompting in 2022, which showed that simply asking a model to "think step by step" unlocked arithmetic and commonsense reasoning that direct answering could not reach. CoT became the default for reasoning, and by late 2024 it was the substrate that reinforcement-learning reasoning models were trained on: produce a long visible trace, reward the ones that land on correct answers.

Two threads then complicated the clean picture that "the words are the reasoning."

The first was the discovery that the content of the intermediate tokens sometimes does not matter. Pause-token work trained models to insert learnable placeholder tokens before answering, buying extra forward passes of computation without emitting any reasoning words, and found measurable gains on several tasks (Goyal et al., 2024, Think Before You Speak: Training Language Models With Pause Tokens, arXiv:2310.02226). The "Let's Think Dot by Dot" result pushed this further: transformers could solve hard algorithmic tasks using meaningless filler tokens such as "......" in place of a real chain of thought, as long as they were given dense enough supervision to learn it (Pfau et al., 2024, arXiv:2404.15758). If reasoning can ride on tokens that say nothing, the tokens are not the point. The extra compute is.

The second thread was the realization that the vocabulary is a bottleneck on the reasoning itself. Sampling a token collapses a rich probability distribution over the next thought into a single discrete choice. Anything the model was uncertain about, any branch it wanted to keep open, gets thrown away the moment it commits to a word.

timeline
  title From visible chains to silent ones
  2022 : Chain-of-thought prompting
       : reasoning becomes visible text
  2023 : Pause tokens proposed
       : extra compute without reasoning words
  2024 : Filler tokens (dot by dot)
       : hidden computation behind blank tokens
  2024 : Coconut introduces continuous thought
       : reasoning in latent space
  2025 : Recurrent depth (Huginn)
       : scale test-time compute by looping
  2025 : Hierarchical and tiny recursive models
       : latent reasoning at small parameter counts

Latent reasoning is what you get when you take both threads seriously at once: keep the compute, drop the requirement that it be spelled out, and stop discarding the model's uncertainty at every step.

[IMAGE: Annotated timeline figure rendering the milestones above as a horizontal track, with each node tagged by its arXiv ID and a one-line "what changed" caption, color-coded blue for visible-reasoning eras and purple for latent-reasoning eras.]

How Latent Reasoning Actually Works

Two designs anchor the field, and they attack the problem from opposite ends. Coconut changes what flows between steps. Recurrent-depth models change how many steps the architecture can take. Understanding both is the fastest way to build the mental model.

Continuous thought: reasoning that never becomes a word

In an ordinary autoregressive model, generating a reasoning step looks like this. The transformer produces a final-layer hidden state \(h_t\) for position \(t\). That state is projected to logits over the vocabulary, a token is selected, and that token is mapped back through the embedding matrix to produce the input vector for the next position. The hidden state, a dense vector in \(\mathbb{R}^d\) with \(d\) often 4096 or larger, is squeezed down to a choice among tens of thousands of discrete symbols and then re-expanded.

Coconut deletes the middle. In "latent mode," the last hidden state \(h_t\) is fed directly back as the next input embedding, with no projection to logits and no token sampling in between:

\[x_{t+1} = h_t\]

The model still runs a full forward pass; it just reasons on a vector that was never rounded to a word. Hao and colleagues call each such vector a continuous thought. The model emits a small number of them, then switches back to ordinary token decoding to produce the final answer.

[IMAGE: Anatomy diagram of one transformer step in latent mode, labelling the residual stream, the final-layer hidden state h_t, the deleted logit-projection path (greyed out with an X), and the feedback arrow from h_t to the next input slot.]

The conceptual payoff is that a continuous thought is not a single guess. Because it is a full distribution-bearing vector rather than one sampled token, it can encode several candidate next steps at once. The theoretical follow-up frames this as reasoning by superposition: the hidden state holds a weighted mixture of search frontier nodes, so iterating in latent space approximates a breadth-first search over reasoning paths rather than the depth-first commitment of greedy decoding (Reasoning by Superposition, 2025, arXiv:2505.12514).

flowchart TD
  S[Start node] --> B{Continuous thought}
  B -->|branch A| A1[Path A frontier]
  B -->|branch B| B1[Path B frontier]
  B -->|branch C| C1[Path C frontier]
  A1 --> M[Next continuous thought<br/>keeps viable branches]
  B1 --> M
  C1 --> M
  M --> Ans[Decode once a path resolves]
  class S blue
  class B,M purple
  class A1,B1,C1 slate
  class Ans teal
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

A token chain has to pick one branch and live with it; if it picks wrong, it backtracks in text, spending tokens to undo. A continuous thought can carry the live branches forward together and let the wrong ones decay, which is exactly why the gains show up on tasks that need search.

Training a model to think in latent space

You cannot just switch a pretrained model into latent mode and expect coherent thoughts; it has never seen its own hidden states as inputs. Coconut uses a curriculum that gradually replaces written reasoning with latent steps. The model starts with a normal language CoT for a problem. Then, stage by stage, the first reasoning step is removed from the text and replaced by a continuous thought; then the first two; and so on, until most of the chain lives in latent space and only the question and final answer remain as tokens.

stateDiagram-v2
  [*] --> FullCoT
  FullCoT --> Stage1: replace step 1 with latent
  Stage1 --> Stage2: replace step 2 with latent
  Stage2 --> StageK: continue curriculum
  StageK --> MostlyLatent: reasoning lives in vectors
  MostlyLatent --> [*]
  note right of FullCoT
    Written reasoning supervises early
  end
  note right of MostlyLatent
    Few or no reasoning tokens remain
  end

The curriculum matters because of a finding that recurs across this literature: latent reasoning is hard to learn. Filler-token training only converged with "specific, dense supervision" (Pfau et al., 2024), and Coconut's staged replacement plays the same role, giving the model a written scaffold to imitate before the scaffold is removed.

Recurrent depth: more thinking, same tokens

The second design leaves the input-output interface alone and instead makes the architecture able to think longer. A standard transformer has a fixed number of layers, so it does a fixed amount of computation per token no matter how hard the token is. Recurrent-depth models break that link. They wrap a block of layers in a loop and run it a variable number of times before reading out, so depth becomes a runtime dial rather than a fixed property of the weights.

Geiping and colleagues built a proof-of-concept, Huginn, with this shape: a prelude that embeds the input, a recurrent core that is iterated to arbitrary depth at test time, and a coda that decodes (Geiping et al., 2025, arXiv:2502.05171). The reasoning is the repeated application of the core to a latent state; the more iterations, the more compute spent, all without emitting a single extra token.

graph TD
  In[Input tokens] --> P[Prelude<br/>embed to latent]
  P --> R[Recurrent core block]
  R --> Chk{Enough iterations}
  Chk -->|no| R
  Chk -->|yes| Coda[Coda decode]
  Coda --> Out[Output tokens]
  class In blue
  class P,Coda slate
  class R purple
  class Chk amber
  class Out teal
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

This approach has three properties the paper emphasizes. It needs no specialized chain-of-thought training data, because the recurrence is part of pretraining. It works with small context windows, since the extra compute lives in iterations rather than in a growing token sequence. And it can capture reasoning that is awkward to put into words, because nothing forces the intermediate states to be verbalizable. The team scaled the model to 3.5 billion parameters trained on 800 billion tokens and showed it could reach reasoning performance equivalent to a 50-billion-parameter fixed-depth model by spending more iterations at test time.

The two designs are complementary. Coconut is a way to make autoregressive reasoning latent; recurrent depth is a way to make a single token's computation arbitrarily deep. Both replace "more visible tokens" with "more hidden compute."

Seeing It in Motion

The clearest way to feel the difference is to watch the generation loop. Here is what happens inside Coconut when it answers in latent mode and then decodes.

sequenceDiagram
  participant U as User
  participant M as Model
  participant H as Hidden state
  participant V as Vocabulary
  U->>M: Question tokens
  M->>H: Produce hidden state
  H->>M: Feed state back as next input
  Note over M,H: Repeat for k latent steps, no decoding
  M->>V: Switch to token mode
  V->>U: Decode final answer

Contrast that with token CoT, where every reasoning step would add an arrow down to the vocabulary and back up, \(k\) times over. The latent loop touches the vocabulary twice: once to read the question, once to write the answer.

For recurrent depth, the lifecycle of a single token's computation is the interesting object. The same weights are applied repeatedly to a latent state that is supposed to converge toward an answer.

flowchart LR
  E[Embed token] --> S0[Latent state v0]
  S0 --> Step[Apply core]
  Step --> S1[Updated state]
  S1 --> Test{Converged or budget hit}
  Test -->|no| Step
  Test -->|yes| Read[Read out]
  class E blue
  class S0,S1 purple
  class Step purple
  class Test amber
  class Read teal
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

[IMAGE: Line plot of reasoning-benchmark accuracy versus number of recurrent iterations at test time, showing the curve rising then flattening, annotated with the "compute equivalent to ~50B params" point from Geiping et al.]

By the Numbers

The headline results are concentrated on reasoning and planning tasks, and they come with an honest counterexample on arithmetic. The table below collects figures reported in the primary papers; treat them as the authors' own measurements on their setups, not as universal constants.

Benchmark	Method	Accuracy	Reasoning steps or tokens	Source
ProntoQA (logic)	Coconut (latent)	99.8%	~9 continuous thoughts	Hao et al., 2024
ProntoQA (logic)	Token CoT (GPT-2)	98.8%	~92.5 tokens	Hao et al., 2024
ProsQA (planning)	Coconut (latent)	97.0%	~14.2 continuous thoughts	Hao et al., 2024
ProsQA (planning)	Token CoT (GPT-2)	77.5%	~49.4 tokens	Hao et al., 2024
GSM8K (math)	Coconut (latent)	34.1%	~8.2 continuous thoughts	Hao et al., 2024
GSM8K (math)	Token CoT baseline	42.9%	~25 tokens	Hao et al., 2024

Three things stand out. On ProsQA, latent reasoning is both more accurate and roughly three times cheaper in steps, the clearest win in the set. On ProntoQA it matches token CoT at a fraction of the length. On GSM8K it loses, which tells you the technique is not a universal upgrade; it shines where search and backtracking dominate and lags where the answer is a fairly linear arithmetic derivation that words capture well.

[IMAGE: Grouped bar chart comparing accuracy and step count side by side for ProntoQA, ProsQA, and GSM8K, latent versus token CoT, with the ProsQA accuracy gap and the GSM8K regression both annotated.]

For recurrent depth, the relevant number is the compute-to-quality ratio: a 3.5B-parameter model reaching the reasoning quality of a roughly 50B-parameter model by unrolling more iterations at inference (Geiping et al., 2025). That is a statement about test-time scaling, not parameter count; the weights are small, the spent compute is large.

The complexity tradeoff is worth stating plainly. Let \(s\) be the number of reasoning steps and \(L\) the model depth. Token CoT pays roughly \(O(s)\) in sequence length, and each new step attends over a growing context, so the attention cost grows with the chain. Latent reasoning of either flavor keeps the sequence short and instead pays \(O(s \cdot L)\) or \(O(\text{iterations} \cdot L)\) in forward-pass compute on a fixed-length input. You move the cost out of the KV cache and into repeated passes.

A Concrete Example

Take a small graph-reachability puzzle, the kind ProsQA is built from. The model is told a set of edges and asked whether node A can reach node G.

edges: A->B, A->C, B->D, C->E, D->G, E->F
query: can A reach G?

A token chain-of-thought would commit to one path at a time and write it down:

A connects to B and C. Take B. B connects to D. D connects to G. Found G. Answer: yes.

That trace got lucky by choosing B first. Had it chosen C, it would have walked A, C, E, F, hit a dead end, and spent tokens backtracking: "F has no edge to G, go back, try B." Every exploratory move and every retraction is paid for in emitted tokens, and the context grows with each one.

Now trace the latent version step by step, watching the continuous thought as a frontier vector rather than a single choice:

Latent step	Frontier encoded in the continuous thought	What it represents
1	{B, C}	both neighbors of A, weighted
2	{D, E}	expand B to D and C to E together
3	{G, F}	D reaches G, E reaches F, in superposition
decode	G is present	the reachable target is found

The continuous thought at step 2 is not "the model picked D." It is a vector that holds D and E at once, the superposition the theory describes (arXiv:2505.12514). By step 3, G appears in the frontier, and only then does the model decode a token: "yes." It never wrote down the dead-end path through E and F, because it never had to commit to it. The fourteen-ish continuous thoughts ProsQA needs on average buy this breadth-first behavior; the forty-nine token CoT spends much of its length on the wrong branch and the cost of unwinding it.

[IMAGE: Animation-style filmstrip of the graph, with the active frontier set highlighted at each latent step (step 1: A's neighbors lit; step 2: their neighbors lit; step 3: G lit), beside the token-CoT version showing a single highlighted path that has to backtrack.]

The honesty check: replace this with a GSM8K word problem about apples and prices, and the advantage shrinks or reverses. There is little to search; the reasoning is a short arithmetic line, and writing it out in tokens is both accurate and naturally supervised. That is precisely the regime where Coconut underperformed.

Where It Breaks

The failure modes of latent reasoning are not incidental; they are the flip side of its strengths.

Training instability is the first wall. A pretrained model has never consumed its own hidden states as inputs, and naive attempts to make it do so collapse. Both Coconut's staged curriculum and the dense supervision required for filler tokens exist to work around this; remove the scaffolding and the model fails to learn to use the latent steps at all (Pfau et al., 2024).

Task sensitivity is the second. The GSM8K regression is not a footnote; it is the boundary of the method. Latent reasoning helps where the bottleneck is search and planning and hurts where the reasoning is a clean linear derivation that words encode well and that written supervision teaches easily.

The third is evaluation fragility, and it is a cautionary tale about reading benchmark wins. The Hierarchical Reasoning Model reported 40.3% on ARC-AGI with only 27M parameters, beating much larger CoT models on the same evaluation (Wang et al., 2025, Hierarchical Reasoning Model, arXiv:2506.21734). When the ARC Prize team reproduced it, they found the headline architecture mattered far less than advertised: a similarly sized ordinary transformer did about as well, and the real driver of the gains was an outer refinement loop, while the puzzle-specific embeddings made the setup largely transductive, meaning test puzzles had to be present at training time (ARC Prize, 2025, The Hidden Drivers of HRM's Performance on ARC-AGI). The lesson generalizes: when reasoning moves into latent dynamics you cannot read, it gets correspondingly harder to attribute why a number went up.

That last point is the deepest one. Chain-of-thought traces are not just an output format; they are a window. They let you spot when a model reaches the right answer for the wrong reason, and they give safety and oversight work something concrete to monitor. Filler-token research already showed that models can route real computation through tokens that reveal nothing, and the authors flagged "concerns about large language models engaging in unauditable, hidden computations that are increasingly detached from the observed chain-of-thought" (Pfau et al., 2024). Latent reasoning is that concern by construction: the reasoning is, on purpose, not in the tokens.

[IMAGE: Two-panel comparison. Left: a readable CoT trace with one step circled as "caught: wrong reasoning, right answer". Right: a row of opaque latent vectors with a question mark, captioned "nothing to inspect".]

Alternative Designs

Latent reasoning sits inside a broader menu of ways to spend more compute on a hard problem. The options trade auditability, training cost, and where the compute lives.

Approach	How it adds compute	Strengths	Weaknesses	Best when
Token chain-of-thought	More emitted reasoning tokens	Readable, easy to supervise and RL-train	Long sequences, growing KV cache, lossy per-step commitment	You need auditability and the reasoning is verbalizable
Continuous thought (Coconut)	Feed hidden state forward, no decoding	Cheap steps, encodes search frontier	Hard to train, task-sensitive, opaque	Planning and search tasks with backtracking
Recurrent depth (Huginn)	Loop a core block at test time	No special data, small context, tunable depth	Custom architecture, novel pretraining	You want test-time scaling without longer outputs
Pause or filler tokens	Insert blank compute tokens	Minimal architecture change	Gains modest, needs dense supervision	Light extra compute on an existing model
Hierarchical or tiny recursive nets	Iterate small modules in latent space	Strong on puzzles at tiny parameter counts	Benchmark claims need scrutiny, often transductive	Narrow structured-reasoning domains

The honest framing is that these are not strictly ranked. Token CoT remains the workhorse precisely because it is legible and trains cleanly with reinforcement learning from verifiable rewards. Latent methods are specialists that pay off on the search-heavy tail and in compute-constrained settings, and the tiny-network results (Less is More: Recursive Reasoning with Tiny Networks, 2025, arXiv:2510.04871) suggest the design space for small latent reasoners is still wide open.

How It Is Used in Practice

As of mid-2026, latent reasoning is mostly a research frontier rather than a default in shipped frontier models, and it is worth being precise about that. The large commercial reasoning systems still primarily scale visible token chains, because that interface is the one their reinforcement-learning pipelines, evaluation harnesses, and safety monitoring are all built around. Latent reasoning's production footprint is in narrower places.

The recurrent-depth model Huginn was released openly with weights and a training recipe, which makes it the most reproducible production-grade artifact in the area and a common starting point for teams experimenting with test-time depth scaling (Geiping et al., 2025). The appeal for inference engineering is concrete: depth scaling does not lengthen the output, so it does not inflate the KV cache or the latency that comes from generating thousands of reasoning tokens. For a fixed answer length you get a compute dial that adapts to problem difficulty.

The structured-reasoning niche is where the small latent models live. HRM and its successors target puzzle domains such as ARC, Sudoku, and mazes, where the input is a grid and the reasoning is iterative refinement rather than language (Wang et al., 2025). The ARC Prize reanalysis is the necessary caveat on any production claim here: the impressive sample-efficiency came substantially from the refinement loop and a transductive setup, so these are not yet general reasoners you can drop into an open-ended product (ARC Prize, 2025).

The operational considerations are the usual ones for a new compute primitive. Latency is governed by iteration or latent-step count, which you can cap. Cost moves from token output into forward passes, which changes how you budget. And monitoring is the open problem: standard guardrails that inspect the reasoning trace have nothing to read, so teams adopting latent reasoning in any sensitive setting need a different oversight story.

[IMAGE: System diagram of an inference stack with a latent-reasoning model, showing a "compute budget" control feeding the iteration loop, a KV cache annotated as "stays small", and a greyed-out "trace monitor" box marked "no visible reasoning to inspect".]

Insights Worth Remembering

The tokens in a chain-of-thought are doing two jobs at once: carrying computation and being human-readable. Latent reasoning is the bet that you can keep the first job while dropping the second.
A sampled token is a lossy commitment. The single biggest conceptual gain of continuous thought is that it stops throwing away the model's uncertainty at every step.
Reasoning-by-superposition reframes latent steps as approximate breadth-first search, which is why the wins cluster on planning and backtracking tasks and evaporate on linear arithmetic.
Depth and tokens are two separate knobs for test-time compute. Recurrent-depth models turn the architecture into the knob, so a small model can spend large compute without saying more.
Latent reasoning is consistently hard to train. Whenever it works, look for the scaffolding (a curriculum, dense supervision) that taught the model to use its own hidden states.
A benchmark win from a latent model deserves extra scrutiny, because when the reasoning is invisible, the reasons for the number are too. The HRM reanalysis is the case study.
The efficiency story and the interpretability story are the same story told with opposite signs. The compute you save by not spelling out reasoning is exactly the oversight you lose.

Open Questions

What is firmly established: continuous thoughts can match or beat token CoT on search-heavy benchmarks at a fraction of the step count (Hao et al., 2024); recurrent depth can trade test-time iterations for effective model size (Geiping et al., 2025); and additional compute can help even when carried by tokens that say nothing (Pfau et al., 2024). Those are measured results.

Beyond that, the field is genuinely unsettled. Whether latent reasoning scales to frontier-sized models and broad task distributions, rather than the controlled benchmarks where it has been demonstrated, is an open empirical question; the published proof-of-concept models are small. It is also unclear how to combine latent reasoning with reinforcement learning from verifiable rewards, the technique behind today's strongest reasoning models, since RL currently rewards visible traces and a latent trace offers no per-step signal to shape.

The interpretability question may be the one that decides adoption. There is early work probing what continuous thoughts encode and trying to decode latent steps back into language post hoc, but no mature method for auditing latent reasoning the way you can read a CoT. If oversight regimes come to require legible reasoning, latent methods may stay confined to low-stakes, high-efficiency settings regardless of their raw capability. Whether someone finds a way to get the compute benefits of latent reasoning while preserving a readable trace, perhaps by periodically decoding the hidden state into a faithful summary, is, as far as the current evidence goes, an open problem rather than a solved one.