← Blog

Diffusion Language Models: Writing Text by Denoising, Not Predicting the Next Token

June 10, 2026 · 22 min read

In May 2025, Google DeepMind showed a model writing a working block of code in roughly the time it takes a person to blink. Gemini Diffusion was clocked at around 1,479 tokens per second, an order of magnitude past the speed-optimized autoregressive models of the day (Google DeepMind, 2025, Gemini Diffusion). The trick was not a faster GPU or a smaller model. It was a different way of generating: instead of committing to one token and then asking what comes next, the model produced an entire draft of mostly-garbage tokens and refined the whole thing in parallel over a few passes.

That is the diffusion recipe, borrowed from the image-generation world and bent to fit discrete text. For most of the Transformer era it was a sideline, interesting to a handful of labs and slower than the autoregressive models it hoped to replace. By the end of 2025 it had produced commercial systems and a NeurIPS oral. This piece is about what changed, how the mechanism actually works on tokens rather than pixels, and where it still falls short.

Why this matters: Autoregressive decoding is serial by construction; generating the 500th token requires the 499 before it. That serial dependency, not raw compute, is what bounds the latency of a chatbot reply. Diffusion language models break the dependency, and in doing so they change which problems are latency-bound and which are not.

TL;DR

  • Diffusion language models generate by iterative denoising: they start from a fully corrupted (usually fully masked) sequence and unmask or correct tokens over a small number of steps, all positions in parallel, rather than one left-to-right token at a time.
  • The dominant modern formulation is masked discrete diffusion. The forward process replaces tokens with a [MASK] symbol on a schedule; a Transformer learns the reverse process of predicting the originals. This is a strict generalization of BERT-style masking, trained to be a full generative model.
  • LLaDA, an 8B diffusion model trained from scratch, matched LLaMA3 8B on a broad benchmark suite and beat GPT-4o on a reversal-curse task, the first strong evidence that diffusion scales to LLM-grade quality (Nie et al., 2025, arXiv:2502.09992).
  • The headline advantage is throughput. Commercial diffusion models (Inception's Mercury, Gemini Diffusion) report 1,000 to 1,500 tokens per second, roughly 5x to 10x faster than comparable autoregressive models, because steps are decoupled from sequence length.
  • The headline cost is that naive diffusion lacked a KV cache and lost quality when unmasking many tokens at once. Fast-dLLM recovered both, reporting up to 27.6x throughput gains training-free (Wu et al., 2025, arXiv:2505.22618).
  • Diffusion is natively bidirectional and good at infilling and global revision, which autoregression handles awkwardly. It is weaker at variable-length generation and exact left-context conditioning, which autoregression gets for free.
  • Block diffusion interpolates between the two paradigms, doing diffusion within blocks and autoregression across them, and recovers the KV cache and arbitrary-length generation (Sahoo et al., 2025, arXiv:2503.09573).

At a Glance

The contrast with autoregression is the whole story, so start there. An autoregressive model builds the answer one position at a time; a diffusion model builds all positions at once and sharpens them over a few rounds.

flowchart LR
  subgraph AR["Autoregressive"]
    direction LR
    A1["the"] --> A2["cat"] --> A3["sat"] --> A4["..."]
  end
  subgraph DIFF["Diffusion"]
    direction LR
    D0["all MASK"] --> D1["partial draft"] --> D2["sharper draft"] --> D3["final text"]
  end
  AR -.serial, N steps for N tokens.-> COST1["latency grows with length"]
  DIFF -.parallel, K steps fixed.-> COST2["latency set by step count"]

  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  class A1,A2,A3,A4 blue
  class D0,D1,D2,D3 purple
  class COST1 amber
  class COST2 emerald

The number of denoising steps K is a knob set by the practitioner, not by the length of the output. Generating a 50-token answer and a 500-token answer can both take the same number of model evaluations. That single fact is why the latency curves cross.

[IMAGE: Side-by-side animation frames showing an autoregressive model filling a sentence left to right versus a diffusion model with masked tokens resolving in scattered positions across four steps]

Before Parallel Text

Diffusion as a generative idea matured on images. The denoising diffusion probabilistic model treated a photo as a point in continuous space, added Gaussian noise until it was static, and trained a network to walk backward to a clean image (Ho et al., 2020, Denoising Diffusion Probabilistic Models, arXiv:2006.11239). Continuous noise is natural for pixels. Text is the awkward case: a token is a discrete symbol from a fixed vocabulary, and there is no obvious meaning to "30% of the way between cat and dog."

Two lineages tried to fix the mismatch. One kept the continuous machinery and mapped tokens into an embedding space where Gaussian noise makes sense, then rounded back to words at the end. Diffusion-LM took this route and showed it enabled fine-grained, gradient-guided control over generated text, though it was slow and fiddly to round correctly (Li et al., 2022, Diffusion-LM Improves Controllable Text Generation, arXiv:2205.14217).

The other lineage embraced the discreteness. D3PM defined diffusion directly on tokens using structured corruption matrices, including one with an absorbing state: tokens decay into a special [MASK] symbol rather than into noise. The paper noted that this absorbing-state process draws a clean line connecting diffusion to autoregressive and mask-based models (Austin et al., 2021, Structured Denoising Diffusion Models in Discrete State-Spaces, arXiv:2107.03006). That observation turned out to be the seed of everything that followed.

The breakthrough that made discrete diffusion competitive was a better training objective. Score Entropy Discrete Diffusion (SEDD) introduced a loss that extends score matching to discrete spaces, cut perplexity by 25 to 75 percent over earlier diffusion language models, and edged past a comparable GPT-2 (Lou et al., 2024, Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, arXiv:2310.16834). It won a best-paper award at ICML 2024. Within a year the recipe had been scaled to 8 billion parameters and a commercial product line.

timeline
  title Evolution of Diffusion Language Models
  2020 : DDPM matures diffusion on images
  2021 : D3PM defines discrete diffusion, absorbing-state mask process
  2022 : Diffusion-LM does controllable text in embedding space
  2023 : SEDD's score-entropy loss closes the perplexity gap
  2024 : SEDD wins ICML best paper, masked diffusion recipe consolidates
  2025 : LLaDA 8B matches LLaMA3 8B, Gemini Diffusion and Mercury ship
  2026 : Mercury 2 reports 1000-plus tokens per second in production

How Diffusion Language Models Actually Work

Strip away the image-generation vocabulary and a masked diffusion language model is doing something a Transformer person already half-knows: it is a masked language model, like BERT, trained at every masking ratio at once and turned into a generator by repeated application.

The forward process: planned destruction

Training starts by destroying data on purpose. Take a clean token sequence \(x_0\) of length \(L\). The forward process corrupts it according to a continuous time variable \(t\) running from 0 (clean) to 1 (fully destroyed). In the masked formulation, each token is independently replaced by [MASK] with probability \(t\), and left untouched otherwise.

\[q(x_t \mid x_0) = \prod_{i=1}^{L} \left[(1-t)\,\mathbf{1}(x_t^i = x_0^i) + t\,\mathbf{1}(x_t^i = \text{MASK})\right]\]

At \(t = 0.3\), roughly 30 percent of tokens are masked; at \(t = 1\), all of them are. There is nothing to learn here. The forward process is a fixed, known corruption schedule, the discrete analogue of adding a known amount of Gaussian noise to a photo.

[IMAGE: A clean sentence at t=0 progressively dissolving into MASK tokens as t increases from 0 to 1, with the masking probability annotated at t=0.3, 0.6, and 1.0]

The reverse process: the only thing the model learns

The network's job is the reverse: given a partially masked sequence \(x_t\), predict the clean tokens that were masked out. Because the Transformer is bidirectional (no causal mask), every prediction at a masked position can attend to every unmasked token on both sides. The model outputs a distribution over the vocabulary for each masked slot.

Training optimizes a likelihood lower bound. For the masked formulation it reduces to a clean, reweighted cross-entropy: sample a time \(t\), mask the sequence accordingly, and ask the model to predict the originals only at masked positions, weighting the loss by $1/t$.

\[\mathcal{L} = \mathbb{E}_{t,\,x_0,\,x_t}\left[\frac{1}{t}\sum_{i:\,x_t^i = \text{MASK}} -\log p_\theta\!\left(x_0^i \mid x_t\right)\right]\]

[IMAGE: Diagram contrasting BERT's single fixed 15 percent masking ratio against a diffusion LM's training across the full 0 to 100 percent masking range, shown as two distributions over masking ratio]

This is the LLaDA objective, and the resemblance to BERT's masked-token loss is not a coincidence. The difference is that BERT trains at a single, fixed masking ratio (about 15 percent) and is never used to generate. A diffusion LM trains across the full range of ratios from near-zero to 100 percent, which is exactly what it needs to denoise a sequence starting from all-mask (Nie et al., 2025, arXiv:2502.09992).

Generation: unmask, recheck, repeat

Sampling runs the reverse process. Start with \(x_1\), a sequence of all [MASK]. Pick a number of steps \(K\). At each step, the model predicts clean tokens for every masked position, you commit some of them (the confident ones), and you leave the rest masked for the next pass.

[IMAGE: A 6-row grid showing one sequence over six denoising steps, masked cells in grey filling in with words, annotated with the per-step confidence threshold that decided which cells were committed]

The architecture is a standard decoder-only Transformer stack with the causal mask removed, so attention is bidirectional.

graph TD
  IN["Masked sequence x_t"] --> EMB["Token + position embeddings"]
  EMB --> ATT["Bidirectional self-attention<br/>no causal mask"]
  ATT --> FFN["Feed-forward layers"]
  FFN --> HEAD["Per-position vocab logits"]
  HEAD --> CONF["Confidence scoring<br/>per masked position"]
  CONF --> COMMIT["Commit high-confidence tokens"]
  COMMIT --> REMASK["Re-mask the rest, next step"]
  REMASK -.feed back.-> IN

  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class IN,EMB blue
  class ATT,FFN,HEAD purple
  class CONF,COMMIT teal
  class REMASK slate

The decision of which tokens to commit at each step is the central design choice. The simplest schedule unmasks a fixed fraction per step in random order. A much better one is confidence-based: at each step, keep the predictions whose probability exceeds a threshold and re-mask the rest. Easy positions (a closing parenthesis, the obvious next word in a fixed phrase) resolve early and anchor the harder positions around them. This is where diffusion's bidirectionality pays off: a token decided late gets to condition on tokens decided early on both sides.

Why steps decouple from length

The reason latency stops scaling with output length is structural. One model evaluation processes the entire length-\(L\) sequence in parallel, exactly as a single Transformer forward pass always has. Autoregression needs \(L\) such passes to produce \(L\) tokens. Diffusion needs \(K\) passes regardless of \(L\), where \(K\) is the step count you chose. If \(K \ll L\), you win. The catch, addressed below, is that pushing \(K\) too low degrades quality.

Seeing It in Motion

Walk the generation loop as a conversation between the sampler and the model. The sampler holds the working sequence; the model is queried once per step and returns predictions plus confidences.

sequenceDiagram
  participant S as Sampler
  participant M as Diffusion Transformer
  S->>M: Step 1 - all MASK sequence
  M-->>S: Predictions + confidences (all positions)
  S->>S: Commit tokens above threshold, re-mask rest
  S->>M: Step 2 - partially filled sequence
  M-->>S: Refined predictions for remaining MASKs
  S->>S: Commit next batch, re-mask rest
  Note over S,M: ... a few more steps ...
  S->>M: Step K - one or two MASKs left
  M-->>S: Final predictions
  S->>S: Sequence fully committed, return text

A useful way to see the trade-off is as a state machine over a single token position. Each position is born masked and ends committed; the only question is on which step it crosses over, and that depends on how confident the model is about it relative to the threshold.

stateDiagram-v2
  [*] --> Masked
  Masked --> Predicted: model scores this position
  Predicted --> Committed: confidence above threshold
  Predicted --> Masked: confidence below threshold, wait
  Committed --> [*]

  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class Masked slate
  class Predicted purple
  class Committed teal

The re-mask edge, where a low-confidence prediction is thrown away and retried next step, is what lets the model fix its own mistakes. Autoregression cannot do this: once a token is sampled it is frozen and every later token is conditioned on it, errors and all. A diffusion model treats early drafts as provisional.

By the Numbers

The quantitative case for diffusion is throughput, and the case against the naive version is the step-count tax. Both show up in measured numbers.

Model Params Type Reported throughput Quality anchor
LLaMA3 8B 8B autoregressive tens of tok/s (hardware-dependent) baseline 8B
LLaDA 8B 8B masked diffusion comparable to AR per-step matches LLaMA3 8B on broad suite
Mercury 2 undisclosed diffusion ~1,196 tok/s (Artificial Analysis) competitive with Haiku-class
Gemini Diffusion undisclosed diffusion ~1,479 tok/s (vendor) rivals Google's fast AR models

Sources: LLaDA quality and parity claims from Nie et al., 2025, arXiv:2502.09992; Mercury throughput and benchmark scores reported by Inception Labs and the independent Artificial Analysis testing service (Inception Labs, 2026); Gemini Diffusion figure from Google DeepMind's announcement. Treat the vendor throughput numbers as best-case marketing figures measured in latency-optimized regimes, not guaranteed steady-state production rates.

The complexity table explains where the speed comes from and what it costs in compute.

Quantity Autoregressive Masked diffusion
Forward passes to generate \(L\) tokens \(L\) \(K\) (steps), \(K\) tunable
Attention cost per pass \(O(L^2)\) (with KV cache, \(O(L)\) incremental) \(O(L^2)\) per pass
Total attention work \(O(L^2)\) \(O(K \cdot L^2)\)
Serial dependency full (\(L\) steps) \(K\) steps
KV cache (naive) yes no (bidirectional context changes each step)

[IMAGE: Line chart of wall-clock latency versus output length, autoregressive rising linearly while diffusion stays roughly flat, with the crossover point marked]

The diffusion column does more total compute, \(K\) full forward passes over the whole sequence, yet finishes sooner in wall-clock time because those passes are not serially chained to the output length. It is trading FLOPs for latency. That trade is attractive exactly when you are latency-bound and have spare parallel compute, which describes a lot of interactive serving.

The naive "no KV cache" entry was the practical killer until 2025. Because a diffusion model's context changes at every step (newly committed tokens become visible context), the keys and values cannot simply be cached and reused the way they are in autoregression. Fast-dLLM solved this with a block-wise approximate KV cache plus confidence-aware parallel decoding, reporting up to 27.6x throughput improvement on LLaDA and Dream with minimal accuracy loss, entirely training-free (Wu et al., 2025, arXiv:2505.22618).

A Concrete Example

Generate a short answer to "What is 17 times 4?" with a masked diffusion model, target length 6 tokens, confidence threshold 0.9, and watch the sequence evolve. Tokens are shown as words for readability; _ is [MASK].

Step 0 (initialization). The sequence is all mask.

[ _   _   _   _   _   _ ]

Step 1. The model predicts every position at once and reports a confidence for each. Suppose it returns:

Position Top prediction Confidence
1 "The" 0.71
2 "answer" 0.62
3 "is" 0.95
4 "68" 0.93
5 "." 0.97
6 (end) 0.98

Positions 3, 4, 5, 6 clear the 0.9 threshold and get committed. Positions 1 and 2 do not, so they are re-masked.

[ _   _   is   68   .   <end> ]

Notice the model committed the arithmetic result "68" and the terminal punctuation before it committed the opening words. An autoregressive model could never do this; it would have had to produce "The" and "answer" first. Here the easy, high-information anchors lock in early.

Step 2. Now the model re-predicts positions 1 and 2, this time conditioning on the committed right context "is 68."

Position Top prediction Confidence
1 "The" 0.96
2 "answer" 0.94

Both clear the threshold. The sequence is complete:

[ The   answer   is   68   .   <end> ]

Two model evaluations produced a six-token answer, versus six evaluations for an autoregressive decoder. The win is small here because the sequence is short. Extend the target to 200 tokens and the diffusion model might finish in 10 to 20 steps while the autoregressive model needs 200, which is where the order-of-magnitude latency gaps come from. The example also shows the failure mode in miniature: if the model had committed a wrong-but-confident "72" at position 4, no later step would revisit it, because committed tokens are frozen.

Where It Breaks

Diffusion language models are not a free lunch, and the honest list of weaknesses is long.

The parallelism is a lie when you decode too fast. Committing many tokens in one step assumes those positions are conditionally independent given the current context. They usually are not. If you unmask "New" and "York" in the same step from independent per-position distributions, nothing stops the model from producing "New Delhi" because each position was sampled without seeing the other's choice. Quality degrades as you commit more per step, which is precisely why confidence thresholding (commit only near-certain tokens) and small step counts in tension exist. The Fast-dLLM result is impressive partly because keeping quality while decoding in parallel is genuinely hard (Wu et al., 2025, arXiv:2505.22618).

[IMAGE: Two-step illustration of the conditional-independence failure, position 1 sampling "New" and position 2 sampling "Delhi" independently, producing the incoherent "New Delhi" when "New York" was intended]

Fixed-length generation is awkward. The cleanest masked diffusion formulation generates into a canvas of predetermined length \(L\). You must decide up front how long the answer is, then pad or truncate. Autoregression simply emits an end-of-sequence token whenever it is done. Variable-length generation in diffusion requires extra machinery, which block diffusion supplies but pure diffusion does not handle gracefully.

No free KV cache. The bidirectional context that gives diffusion its revision superpower is the same property that breaks the standard KV cache, because every token's context shifts as neighbors get committed. The cache must be approximated, and approximation means a quality knob to babysit.

More total compute. A diffusion model burns \(K\) full forward passes. If you are throughput-bound on a saturated GPU rather than latency-bound, the extra FLOPs can make diffusion the worse choice. The speed advantage is specifically a latency advantage under spare parallel capacity.

The maturity gap. The autoregressive ecosystem has years of tooling: speculative decoding, paged attention, mature quantization, and battle-tested serving stacks. Diffusion starts behind, and some techniques do not port cleanly. A 2026 reality-check study found diffusion LMs still trailing in tool-use-heavy settings where exact, ordered conditioning matters (reality-check study, 2026, arXiv:2601.12979).

Alternative Designs

Diffusion is one of several non-standard ways to escape strict serial decoding. The fair comparison sets it against autoregression and against the hybrids that blend the two.

Approach Strengths Weaknesses Best when
Autoregressive Simple, KV-cacheable, variable length, mature tooling, exact left-context Serial latency grows with length, no built-in revision, reversal curse General-purpose generation, agentic tool use, long open-ended output
Pure masked diffusion Parallel decoding, bidirectional, strong infilling, self-correction Fixed length, no free KV cache, parallel-decode quality loss Latency-critical code/math, fill-in-the-blank, structured outputs
Block diffusion Variable length + KV cache + parallel within blocks More complex, block-size tuning, intermediate on both axes Wanting diffusion's speed without losing AR's length flexibility
Speculative decoding (AR) Keeps AR quality exactly, 2-3x speedups Needs a draft model, still fundamentally serial Accelerating an existing AR model without changing its outputs

Block diffusion deserves the closest look because it explicitly interpolates. It decomposes a sequence into blocks, runs discrete diffusion within each block, and conditions each block autoregressively on the previous ones. Tuning the block size slides continuously from full diffusion (one big block) to full autoregression (block size one). The payoff is concrete: it restores KV caching across blocks and supports arbitrary-length generation, setting a state of the art among diffusion models on standard language-modeling benchmarks (Sahoo et al., 2025, arXiv:2503.09573). For many production settings this hybrid, not pure diffusion, is the pragmatic answer.

[IMAGE: A slider diagram with "block size = 1 (pure autoregressive)" on the left and "block size = L (pure diffusion)" on the right, showing how KV-cache reuse and parallelism trade off as the slider moves]

How It Is Used in Practice

The first production-scale commercial diffusion LLM was Inception Labs' Mercury, presented as ultra-fast language models based on diffusion (Khanna et al., 2025, Mercury, arXiv:2506.17298). Its successor, Mercury 2, was launched in February 2026 with vendor and third-party throughput figures above 1,000 tokens per second and benchmark scores placing it in competitive range of Haiku-class and GPT-mini-class models on quality while delivering several times the speed. The pitch is explicit and narrow: when your bottleneck is latency and your workload is code completion, structured extraction, or short interactive replies, a diffusion model finishes a draft before an autoregressive model has emitted its first paragraph.

Gemini Diffusion is the research-lab counterpart, shipped as an experimental waitlisted model rather than a default endpoint, and positioned for coding and math where iterative revision over a draft is a natural fit (Google DeepMind, 2025).

The engineering considerations that decide whether diffusion is worth it are consistent. Latency-bound interactive serving with spare GPU parallelism favors diffusion; throughput-bound batch jobs on saturated hardware favor autoregression. Tasks with a known or bounded output length (a JSON object, a function body, a SQL query) fit the fixed-canvas model. Infilling and editing existing text, where you want to condition on both sides of a gap, is diffusion's home turf and autoregression's weak spot.

LLaDA is the open research anchor that made the case credible at scale. Trained from scratch under an ordinary pre-training and supervised-fine-tuning pipeline, it matched LLaMA3 8B on in-context learning across general, math, and code benchmarks, followed instructions after SFT, and notably beat GPT-4o on a reversal-poem-completion task, a direct hit on the reversal curse that plagues left-to-right models (Nie et al., 2025, arXiv:2502.09992). Its acceptance as a NeurIPS 2025 oral marked the point where diffusion stopped being a niche and became a recognized branch of the LLM tree.

Insights Worth Remembering

  • Masked diffusion is BERT's masking objective trained at all ratios and run in a loop. If you understand masked language modeling, you already understand 80 percent of a diffusion LM; the new part is the sampling schedule, not the architecture.
  • The defining advantage is the decoupling of step count from sequence length, not parallelism in the abstract. A diffusion model that uses \(K = L\) steps gives back its entire speed advantage.
  • Bidirectionality is the deeper structural difference. Generating tokens out of order, conditioning each on a both-sided context, is what gives diffusion its self-correction and its immunity to the reversal curse, and it is what breaks the KV cache.
  • The parallel-decoding quality loss is a conditional-independence bug, not a fundamental ceiling. Confidence thresholds, better noise schedules, and approximate caches are all ways of paying down that bug, and 2025 showed it is mostly payable.
  • Diffusion does more total compute to finish sooner. It is a latency optimization that costs FLOPs, the opposite of the usual efficiency story, which is why it helps in interactive serving and hurts in saturated batch.
  • The pragmatic production answer is often the hybrid: block diffusion keeps autoregression's length flexibility and KV cache while buying back parallelism, and may be where the paradigm actually lands.

Open Questions

Several things are genuinely unsettled, and it is worth separating what is measured from what is hoped.

Whether diffusion scales past the mid range to frontier sizes with frontier quality is not yet demonstrated in the open literature. LLaDA's 8B parity is a strong signal, but no public result yet shows a diffusion model winning at the absolute frontier. That is an open empirical question, not a settled fact in either direction.

[IMAGE: Heatmap over denoising steps and token positions showing where model uncertainty concentrates, the "confusion zones," suggesting where extra steps should be spent]

The reasoning behavior of diffusion models is being actively probed. Early 2026 work suggests their errors and uncertainty concentrate in identifiable "confusion zones" during denoising, hinting that step-allocation could be made adaptive, spending more steps where the model is unsure (confusion-zones study, 2025, arXiv:2511.15208). Whether adaptive-compute diffusion becomes the analogue of test-time reasoning in autoregressive models is plausible but unproven.

How diffusion interacts with the alignment stack is partly open. RLHF and its successors were designed around autoregressive token-by-token sampling; applying preference optimization to a denoising process that commits tokens out of order needs care, and the best recipe is still being worked out. The same goes for agentic tool use, where the 2026 reality check found diffusion lagging on workflows that demand exact, ordered conditioning.

Finally, the right step-count-versus-quality operating point is workload-specific and not yet a solved science. Practitioners currently tune \(K\) and the confidence threshold empirically; a principled theory of how few steps a given task can tolerate would turn a lot of guesswork into engineering.

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Technical Blogs and Announcements

Additional Resources

Sign in to save and react.
Share Copied