Diffusion Language Models: Writing Text by Denoising, Not Predicting the Next Token
June 10, 2026 · 22 min read
In May 2025, Google DeepMind showed a model writing a working block of code in roughly the time it takes a person to blink. Gemini Diffusion was clocked at around 1,479 tokens per second, an order of magnitude past the speed-optimized autoregressive models of the day (Google DeepMind, 2025, Gemini Diffusion). The trick was not a faster GPU or a smaller model. It was a different way of generating: instead of committing to one token and then asking what comes next, the model produced an entire draft of mostly-garbage tokens and refined the whole thing in parallel over a few passes.
That is the diffusion recipe, borrowed from the image-generation world and bent to fit discrete text. For most of the Transformer era it was a sideline, interesting to a handful of labs and slower than the autoregressive models it hoped to replace. By the end of 2025 it had produced commercial systems and a NeurIPS oral. This piece is about what changed, how the mechanism actually works on tokens rather than pixels, and where it still falls short.
Why this matters: Autoregressive decoding is serial by construction; generating the 500th token requires the 499 before it. That serial dependency, not raw compute, is what bounds the latency of a chatbot reply. Diffusion language models break the dependency, and in doing so they change which problems are latency-bound and which are not.
TL;DR
- Diffusion language models generate by iterative denoising: they start from a fully corrupted (usually fully masked) sequence and unmask or correct tokens over a small number of steps, all positions in parallel, rather than one left-to-right token at a time.
- The dominant modern formulation is masked discrete diffusion. The forward process replaces tokens with a
[MASK]symbol on a schedule; a Transformer learns the reverse process of predicting the originals. This is a strict generalization of BERT-style masking, trained to be a full generative model. - LLaDA, an 8B diffusion model trained from scratch, matched LLaMA3 8B on a broad benchmark suite and beat GPT-4o on a reversal-curse task, the first strong evidence that diffusion scales to LLM-grade quality (Nie et al., 2025, arXiv:2502.09992).
- The headline advantage is throughput. Commercial diffusion models (Inception's Mercury, Gemini Diffusion) report 1,000 to 1,500 tokens per second, roughly 5x to 10x faster than comparable autoregressive models, because steps are decoupled from sequence length.
- The headline cost is that naive diffusion lacked a KV cache and lost quality when unmasking many tokens at once. Fast-dLLM recovered both, reporting up to 27.6x throughput gains training-free (Wu et al., 2025, arXiv:2505.22618).
- Diffusion is natively bidirectional and good at infilling and global revision, which autoregression handles awkwardly. It is weaker at variable-length generation and exact left-context conditioning, which autoregression gets for free.
- Block diffusion interpolates between the two paradigms, doing diffusion within blocks and autoregression across them, and recovers the KV cache and arbitrary-length generation (Sahoo et al., 2025, arXiv:2503.09573).
At a Glance
The contrast with autoregression is the whole story, so start there. An autoregressive model builds the answer one position at a time; a diffusion model builds all positions at once and sharpens them over a few rounds.
flowchart LR
subgraph AR["Autoregressive"]
direction LR
A1["the"] --> A2["cat"] --> A3["sat"] --> A4["..."]
end
subgraph DIFF["Diffusion"]
direction LR
D0["all MASK"] --> D1["partial draft"] --> D2["sharper draft"] --> D3["final text"]
end
AR -.serial, N steps for N tokens.-> COST1["latency grows with length"]
DIFF -.parallel, K steps fixed.-> COST2["latency set by step count"]
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
class A1,A2,A3,A4 blue
class D0,D1,D2,D3 purple
class COST1 amber
class COST2 emerald
The number of denoising steps K is a knob set by the practitioner, not by the length of the output. Generating a 50-token answer and a 500-token answer can both take the same number of model evaluations. That single fact is why the latency curves cross.
[IMAGE: Side-by-side animation frames showing an autoregressive model filling a sentence left to right versus a diffusion model with masked tokens resolving in scattered positions across four steps]
Before Parallel Text
Diffusion as a generative idea matured on images. The denoising diffusion probabilistic model treated a photo as a point in continuous space, added Gaussian noise until it was static, and trained a network to walk backward to a clean image (Ho et al., 2020, Denoising Diffusion Probabilistic Models, arXiv:2006.11239). Continuous noise is natural for pixels. Text is the awkward case: a token is a discrete symbol from a fixed vocabulary, and there is no obvious meaning to "30% of the way between cat and dog."
Two lineages tried to fix the mismatch. One kept the continuous machinery and mapped tokens into an embedding space where Gaussian noise makes sense, then rounded back to words at the end. Diffusion-LM took this route and showed it enabled fine-grained, gradient-guided control over generated text, though it was slow and fiddly to round correctly (Li et al., 2022, Diffusion-LM Improves Controllable Text Generation, arXiv:2205.14217).
The other lineage embraced the discreteness. D3PM defined diffusion directly on tokens using structured corruption matrices, including one with an absorbing state: tokens decay into a special [MASK] symbol rather than into noise. The paper noted that this absorbing-state process draws a clean line connecting diffusion to autoregressive and mask-based models (Austin et al., 2021, Structured Denoising Diffusion Models in Discrete State-Spaces, arXiv:2107.03006). That observation turned out to be the seed of everything that followed.
The breakthrough that made discrete diffusion competitive was a better training objective. Score Entropy Discrete Diffusion (SEDD) introduced a loss that extends score matching to discrete spaces, cut perplexity by 25 to 75 percent over earlier diffusion language models, and edged past a comparable GPT-2 (Lou et al., 2024, Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, arXiv:2310.16834). It won a best-paper award at ICML 2024. Within a year the recipe had been scaled to 8 billion parameters and a commercial product line.
timeline title Evolution of Diffusion Language Models 2020 : DDPM matures diffusion on images 2021 : D3PM defines discrete diffusion, absorbing-state mask process 2022 : Diffusion-LM does controllable text in embedding space 2023 : SEDD's score-entropy loss closes the perplexity gap 2024 : SEDD wins ICML best paper, masked diffusion recipe consolidates 2025 : LLaDA 8B matches LLaMA3 8B, Gemini Diffusion and Mercury ship 2026 : Mercury 2 reports 1000-plus tokens per second in production
How Diffusion Language Models Actually Work
Strip away the image-generation vocabulary and a masked diffusion language model is doing something a Transformer person already half-knows: it is a masked language model, like BERT, trained at every masking ratio at once and turned into a generator by repeated application.
The forward process: planned destruction
Training starts by destroying data on purpose. Take a clean token sequence \(x_0\) of length \(L\). The forward process corrupts it according to a continuous time variable \(t\) running from 0 (clean) to 1 (fully destroyed). In the masked formulation, each token is independently replaced by [MASK] with probability \(t\), and left untouched otherwise.
At \(t = 0.3\), roughly 30 percent of tokens are masked; at \(t = 1\), all of them are. There is nothing to learn here. The forward process is a fixed, known corruption schedule, the discrete analogue of adding a known amount of Gaussian noise to a photo.
[IMAGE: A clean sentence at t=0 progressively dissolving into MASK tokens as t increases from 0 to 1, with the masking probability annotated at t=0.3, 0.6, and 1.0]
The reverse process: the only thing the model learns
The network's job is the reverse: given a partially masked sequence \(x_t\), predict the clean tokens that were masked out. Because the Transformer is bidirectional (no causal mask), every prediction at a masked position can attend to every unmasked token on both sides. The model outputs a distribution over the vocabulary for each masked slot.
Training optimizes a likelihood lower bound. For the masked formulation it reduces to a clean, reweighted cross-entropy: sample a time \(t\), mask the sequence accordingly, and ask the model to predict the originals only at masked positions, weighting the loss by $1/t$.
\[\mathcal{L} = \mathbb{E}_{t,\,x_0,\,x_t}\left[\frac{1}{t}\sum_{i:\,x_t^i = \text{MASK}} -\log p_\theta\!\left(x_0^i \mid x_t\right)\right]\][IMAGE: Diagram contrasting BERT's single fixed 15 percent masking ratio against a diffusion LM's training across the full 0 to 100 percent masking range, shown as two distributions over masking ratio]
This is the LLaDA objective, and the resemblance to BERT's masked-token loss is not a coincidence. The difference is that BERT trains at a single, fixed masking ratio (about 15 percent) and is never used to generate. A diffusion LM trains across the full range of ratios from near-zero to 100 percent, which is exactly what it needs to denoise a sequence starting from all-mask (Nie et al., 2025, arXiv:2502.09992).
Generation: unmask, recheck, repeat
Sampling runs the reverse process. Start with \(x_1\), a sequence of all [MASK]. Pick a number of steps \(K\). At each step, the model predicts clean tokens for every masked position, you commit some of them (the confident ones), and you leave the rest masked for the next pass.
[IMAGE: A 6-row grid showing one sequence over six denoising steps, masked cells in grey filling in with words, annotated with the per-step confidence threshold that decided which cells were committed]
The architecture is a standard decoder-only Transformer stack with the causal mask removed, so attention is bidirectional.
graph TD IN["Masked sequence x_t"] --> EMB["Token + position embeddings"] EMB --> ATT["Bidirectional self-attention<br/>no causal mask"] ATT --> FFN["Feed-forward layers"] FFN --> HEAD["Per-position vocab logits"] HEAD --> CONF["Confidence scoring<br/>per masked position"] CONF --> COMMIT["Commit high-confidence tokens"] COMMIT --> REMASK["Re-mask the rest, next step"] REMASK -.feed back.-> IN classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0 class IN,EMB blue class ATT,FFN,HEAD purple class CONF,COMMIT teal class REMASK slate
The decision of which tokens to commit at each step is the central design choice. The simplest schedule unmasks a fixed fraction per step in random order. A much better one is confidence-based: at each step, keep the predictions whose probability exceeds a threshold and re-mask the rest. Easy positions (a closing parenthesis, the obvious next word in a fixed phrase) resolve early and anchor the harder positions around them. This is where diffusion's bidirectionality pays off: a token decided late gets to condition on tokens decided early on both sides.
Why steps decouple from length
The reason latency stops scaling with output length is structural. One model evaluation processes the entire length-\(L\) sequence in parallel, exactly as a single Transformer forward pass always has. Autoregression needs \(L\) such passes to produce \(L\) tokens. Diffusion needs \(K\) passes regardless of \(L\), where \(K\) is the step count you chose. If \(K \ll L\), you win. The catch, addressed below, is that pushing \(K\) too low degrades quality.
Seeing It in Motion
Walk the generation loop as a conversation between the sampler and the model. The sampler holds the working sequence; the model is queried once per step and returns predictions plus confidences.
sequenceDiagram participant S as Sampler participant M as Diffusion Transformer S->>M: Step 1 - all MASK sequence M-->>S: Predictions + confidences (all positions) S->>S: Commit tokens above threshold, re-mask rest S->>M: Step 2 - partially filled sequence M-->>S: Refined predictions for remaining MASKs S->>S: Commit next batch, re-mask rest Note over S,M: ... a few more steps ... S->>M: Step K - one or two MASKs left M-->>S: Final predictions S->>S: Sequence fully committed, return text
A useful way to see the trade-off is as a state machine over a single token position. Each position is born masked and ends committed; the only question is on which step it crosses over, and that depends on how confident the model is about it relative to the threshold.
stateDiagram-v2 [*] --> Masked Masked --> Predicted: model scores this position Predicted --> Committed: confidence above threshold Predicted --> Masked: confidence below threshold, wait Committed --> [*] classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0 class Masked slate class Predicted purple class Committed teal
The re-mask edge, where a low-confidence prediction is thrown away and retried next step, is what lets the model fix its own mistakes. Autoregression cannot do this: once a token is sampled it is frozen and every later token is conditioned on it, errors and all. A diffusion model treats early drafts as provisional.
By the Numbers
The quantitative case for diffusion is throughput, and the case against the naive version is the step-count tax. Both show up in measured numbers.
| Model | Params | Type | Reported throughput | Quality anchor |
|---|---|---|---|---|
| LLaMA3 8B | 8B | autoregressive | tens of tok/s (hardware-dependent) | baseline 8B |
| LLaDA 8B | 8B | masked diffusion | comparable to AR per-step | matches LLaMA3 8B on broad suite |
| Mercury 2 | undisclosed | diffusion | ~1,196 tok/s (Artificial Analysis) | competitive with Haiku-class |
| Gemini Diffusion | undisclosed | diffusion | ~1,479 tok/s (vendor) | rivals Google's fast AR models |
Sources: LLaDA quality and parity claims from Nie et al., 2025, arXiv:2502.09992; Mercury throughput and benchmark scores reported by Inception Labs and the independent Artificial Analysis testing service (Inception Labs, 2026); Gemini Diffusion figure from Google DeepMind's announcement. Treat the vendor throughput numbers as best-case marketing figures measured in latency-optimized regimes, not guaranteed steady-state production rates.
The complexity table explains where the speed comes from and what it costs in compute.
| Quantity | Autoregressive | Masked diffusion |
|---|---|---|
| Forward passes to generate \(L\) tokens | \(L\) | \(K\) (steps), \(K\) tunable |
| Attention cost per pass | \(O(L^2)\) (with KV cache, \(O(L)\) incremental) | \(O(L^2)\) per pass |
| Total attention work | \(O(L^2)\) | \(O(K \cdot L^2)\) |
| Serial dependency | full (\(L\) steps) | \(K\) steps |
| KV cache (naive) | yes | no (bidirectional context changes each step) |
[IMAGE: Line chart of wall-clock latency versus output length, autoregressive rising linearly while diffusion stays roughly flat, with the crossover point marked]
The diffusion column does more total compute, \(K\) full forward passes over the whole sequence, yet finishes sooner in wall-clock time because those passes are not serially chained to the output length. It is trading FLOPs for latency. That trade is attractive exactly when you are latency-bound and have spare parallel compute, which describes a lot of interactive serving.
The naive "no KV cache" entry was the practical killer until 2025. Because a diffusion model's context changes at every step (newly committed tokens become visible context), the keys and values cannot simply be cached and reused the way they are in autoregression. Fast-dLLM solved this with a block-wise approximate KV cache plus confidence-aware parallel decoding, reporting up to 27.6x throughput improvement on LLaDA and Dream with minimal accuracy loss, entirely training-free (Wu et al., 2025, arXiv:2505.22618).
A Concrete Example
Generate a short answer to "What is 17 times 4?" with a masked diffusion model, target length 6 tokens, confidence threshold 0.9, and watch the sequence evolve. Tokens are shown as words for readability; _ is [MASK].
Step 0 (initialization). The sequence is all mask.
[ _ _ _ _ _ _ ]
Step 1. The model predicts every position at once and reports a confidence for each. Suppose it returns:
| Position | Top prediction | Confidence |
|---|---|---|
| 1 | "The" | 0.71 |
| 2 | "answer" | 0.62 |
| 3 | "is" | 0.95 |
| 4 | "68" | 0.93 |
| 5 | "." | 0.97 |
| 6 | (end) | 0.98 |
Positions 3, 4, 5, 6 clear the 0.9 threshold and get committed. Positions 1 and 2 do not, so they are re-masked.
[ _ _ is 68 . <end> ]
Notice the model committed the arithmetic result "68" and the terminal punctuation before it committed the opening words. An autoregressive model could never do this; it would have had to produce "The" and "answer" first. Here the easy, high-information anchors lock in early.
Step 2. Now the model re-predicts positions 1 and 2, this time conditioning on the committed right context "is 68."
| Position | Top prediction | Confidence |
|---|---|---|
| 1 | "The" | 0.96 |
| 2 | "answer" | 0.94 |
Both clear the threshold. The sequence is complete:
[ The answer is 68 . <end> ]
Two model evaluations produced a six-token answer, versus six evaluations for an autoregressive decoder. The win is small here because the sequence is short. Extend the target to 200 tokens and the diffusion model might finish in 10 to 20 steps while the autoregressive model needs 200, which is where the order-of-magnitude latency gaps come from. The example also shows the failure mode in miniature: if the model had committed a wrong-but-confident "72" at position 4, no later step would revisit it, because committed tokens are frozen.
Where It Breaks
Diffusion language models are not a free lunch, and the honest list of weaknesses is long.
The parallelism is a lie when you decode too fast. Committing many tokens in one step assumes those positions are conditionally independent given the current context. They usually are not. If you unmask "New" and "York" in the same step from independent per-position distributions, nothing stops the model from producing "New Delhi" because each position was sampled without seeing the other's choice. Quality degrades as you commit more per step, which is precisely why confidence thresholding (commit only near-certain tokens) and small step counts in tension exist. The Fast-dLLM result is impressive partly because keeping quality while decoding in parallel is genuinely hard (Wu et al., 2025, arXiv:2505.22618).
[IMAGE: Two-step illustration of the conditional-independence failure, position 1 sampling "New" and position 2 sampling "Delhi" independently, producing the incoherent "New Delhi" when "New York" was intended]
Fixed-length generation is awkward. The cleanest masked diffusion formulation generates into a canvas of predetermined length \(L\). You must decide up front how long the answer is, then pad or truncate. Autoregression simply emits an end-of-sequence token whenever it is done. Variable-length generation in diffusion requires extra machinery, which block diffusion supplies but pure diffusion does not handle gracefully.
No free KV cache. The bidirectional context that gives diffusion its revision superpower is the same property that breaks the standard KV cache, because every token's context shifts as neighbors get committed. The cache must be approximated, and approximation means a quality knob to babysit.
More total compute. A diffusion model burns \(K\) full forward passes. If you are throughput-bound on a saturated GPU rather than latency-bound, the extra FLOPs can make diffusion the worse choice. The speed advantage is specifically a latency advantage under spare parallel capacity.
The maturity gap. The autoregressive ecosystem has years of tooling: speculative decoding, paged attention, mature quantization, and battle-tested serving stacks. Diffusion starts behind, and some techniques do not port cleanly. A 2026 reality-check study found diffusion LMs still trailing in tool-use-heavy settings where exact, ordered conditioning matters (reality-check study, 2026, arXiv:2601.12979).
Alternative Designs
Diffusion is one of several non-standard ways to escape strict serial decoding. The fair comparison sets it against autoregression and against the hybrids that blend the two.
| Approach | Strengths | Weaknesses | Best when |
|---|---|---|---|
| Autoregressive | Simple, KV-cacheable, variable length, mature tooling, exact left-context | Serial latency grows with length, no built-in revision, reversal curse | General-purpose generation, agentic tool use, long open-ended output |
| Pure masked diffusion | Parallel decoding, bidirectional, strong infilling, self-correction | Fixed length, no free KV cache, parallel-decode quality loss | Latency-critical code/math, fill-in-the-blank, structured outputs |
| Block diffusion | Variable length + KV cache + parallel within blocks | More complex, block-size tuning, intermediate on both axes | Wanting diffusion's speed without losing AR's length flexibility |
| Speculative decoding (AR) | Keeps AR quality exactly, 2-3x speedups | Needs a draft model, still fundamentally serial | Accelerating an existing AR model without changing its outputs |
Block diffusion deserves the closest look because it explicitly interpolates. It decomposes a sequence into blocks, runs discrete diffusion within each block, and conditions each block autoregressively on the previous ones. Tuning the block size slides continuously from full diffusion (one big block) to full autoregression (block size one). The payoff is concrete: it restores KV caching across blocks and supports arbitrary-length generation, setting a state of the art among diffusion models on standard language-modeling benchmarks (Sahoo et al., 2025, arXiv:2503.09573). For many production settings this hybrid, not pure diffusion, is the pragmatic answer.
[IMAGE: A slider diagram with "block size = 1 (pure autoregressive)" on the left and "block size = L (pure diffusion)" on the right, showing how KV-cache reuse and parallelism trade off as the slider moves]
How It Is Used in Practice
The first production-scale commercial diffusion LLM was Inception Labs' Mercury, presented as ultra-fast language models based on diffusion (Khanna et al., 2025, Mercury, arXiv:2506.17298). Its successor, Mercury 2, was launched in February 2026 with vendor and third-party throughput figures above 1,000 tokens per second and benchmark scores placing it in competitive range of Haiku-class and GPT-mini-class models on quality while delivering several times the speed. The pitch is explicit and narrow: when your bottleneck is latency and your workload is code completion, structured extraction, or short interactive replies, a diffusion model finishes a draft before an autoregressive model has emitted its first paragraph.
Gemini Diffusion is the research-lab counterpart, shipped as an experimental waitlisted model rather than a default endpoint, and positioned for coding and math where iterative revision over a draft is a natural fit (Google DeepMind, 2025).
The engineering considerations that decide whether diffusion is worth it are consistent. Latency-bound interactive serving with spare GPU parallelism favors diffusion; throughput-bound batch jobs on saturated hardware favor autoregression. Tasks with a known or bounded output length (a JSON object, a function body, a SQL query) fit the fixed-canvas model. Infilling and editing existing text, where you want to condition on both sides of a gap, is diffusion's home turf and autoregression's weak spot.
LLaDA is the open research anchor that made the case credible at scale. Trained from scratch under an ordinary pre-training and supervised-fine-tuning pipeline, it matched LLaMA3 8B on in-context learning across general, math, and code benchmarks, followed instructions after SFT, and notably beat GPT-4o on a reversal-poem-completion task, a direct hit on the reversal curse that plagues left-to-right models (Nie et al., 2025, arXiv:2502.09992). Its acceptance as a NeurIPS 2025 oral marked the point where diffusion stopped being a niche and became a recognized branch of the LLM tree.
Insights Worth Remembering
- Masked diffusion is BERT's masking objective trained at all ratios and run in a loop. If you understand masked language modeling, you already understand 80 percent of a diffusion LM; the new part is the sampling schedule, not the architecture.
- The defining advantage is the decoupling of step count from sequence length, not parallelism in the abstract. A diffusion model that uses \(K = L\) steps gives back its entire speed advantage.
- Bidirectionality is the deeper structural difference. Generating tokens out of order, conditioning each on a both-sided context, is what gives diffusion its self-correction and its immunity to the reversal curse, and it is what breaks the KV cache.
- The parallel-decoding quality loss is a conditional-independence bug, not a fundamental ceiling. Confidence thresholds, better noise schedules, and approximate caches are all ways of paying down that bug, and 2025 showed it is mostly payable.
- Diffusion does more total compute to finish sooner. It is a latency optimization that costs FLOPs, the opposite of the usual efficiency story, which is why it helps in interactive serving and hurts in saturated batch.
- The pragmatic production answer is often the hybrid: block diffusion keeps autoregression's length flexibility and KV cache while buying back parallelism, and may be where the paradigm actually lands.
Open Questions
Several things are genuinely unsettled, and it is worth separating what is measured from what is hoped.
Whether diffusion scales past the mid range to frontier sizes with frontier quality is not yet demonstrated in the open literature. LLaDA's 8B parity is a strong signal, but no public result yet shows a diffusion model winning at the absolute frontier. That is an open empirical question, not a settled fact in either direction.
[IMAGE: Heatmap over denoising steps and token positions showing where model uncertainty concentrates, the "confusion zones," suggesting where extra steps should be spent]
The reasoning behavior of diffusion models is being actively probed. Early 2026 work suggests their errors and uncertainty concentrate in identifiable "confusion zones" during denoising, hinting that step-allocation could be made adaptive, spending more steps where the model is unsure (confusion-zones study, 2025, arXiv:2511.15208). Whether adaptive-compute diffusion becomes the analogue of test-time reasoning in autoregressive models is plausible but unproven.
How diffusion interacts with the alignment stack is partly open. RLHF and its successors were designed around autoregressive token-by-token sampling; applying preference optimization to a denoising process that commits tokens out of order needs care, and the best recipe is still being worked out. The same goes for agentic tool use, where the 2026 reality check found diffusion lagging on workflows that demand exact, ordered conditioning.
Finally, the right step-count-versus-quality operating point is workload-specific and not yet a solved science. Practitioners currently tune \(K\) and the confidence threshold empirically; a principled theory of how few steps a given task can tolerate would turn a lot of guesswork into engineering.
Sources and Further Reading
Foundational Papers
- Ho, Jain, Abbeel, 2020, Denoising Diffusion Probabilistic Models, arXiv:2006.11239
- Austin, Johnson, Ho, Tarlow, van den Berg, 2021, Structured Denoising Diffusion Models in Discrete State-Spaces, arXiv:2107.03006
- Li, Thickstun, Gulrajani, Liang, Hashimoto, 2022, Diffusion-LM Improves Controllable Text Generation, arXiv:2205.14217
- Lou, Meng, Ermon, 2024, Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (SEDD), arXiv:2310.16834
Important Follow-up Work
- Nie, Zhu, You, et al., 2025, Large Language Diffusion Models (LLaDA), arXiv:2502.09992
- Sahoo, Kuleshov, et al., 2025, Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models, arXiv:2503.09573
- Wu, Zhang, et al., 2025, Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding, arXiv:2505.22618
- Khanna, et al., 2025, Mercury: Ultra-Fast Language Models Based on Diffusion, arXiv:2506.17298