Trainable Sparse Attention: When the Model Learns What to Skip
June 25, 2026 · 19 min read
In September 2025, DeepSeek shipped a model update and cut its API prices by more than half on the same day (DeepSeek API Docs, 2025, Introducing DeepSeek-V3.2-Exp). The price drop was not a marketing promotion. It was a direct consequence of an architecture change: the model no longer computes attention between every pair of tokens. For a 128k-token request, most of that quadratic work simply does not happen, and the savings flow straight to the bill.
What makes this moment different from a decade of prior sparse-attention research is a single design decision. Earlier methods estimated which tokens mattered at inference time, after the model was already trained on dense attention. The newer wave, exemplified by Native Sparse Attention (NSA), Mixture of Block Attention (MoBA), and DeepSeek Sparse Attention (DSA), trains the sparsity pattern itself. The model learns what to skip, with gradients flowing through the selection.
Why this matters: Attention is the one part of a Transformer whose cost grows with the square of the input. As context windows stretch from thousands to millions of tokens, that quadratic term, not parameter count, becomes the thing that decides whether a long-context feature is affordable or not.
TL;DR
- Standard attention costs \(O(L^2)\) in compute and grows a key-value cache linearly with sequence length \(L\). At 64k or 128k tokens, attention dominates both latency and memory.
- Classic sparse attention (Longformer, BigBird) used fixed patterns; later KV-cache methods (H2O, StreamingLLM, Quest) picked important tokens at inference time on a model trained with dense attention.
- The shift in 2025 is making sparsity natively trainable: the selection mechanism is part of the forward pass and is optimized end to end, so the model adapts its weights to the sparse pattern instead of having sparsity imposed on it.
- NSA combines three branches, compressed coarse tokens, fine-grained selected blocks, and a local sliding window, fused by a learned gate, and reports up to 11.6x faster decoding and 9.0x faster forward passes at 64k length (Yuan et al., 2025, arXiv:2502.11089).
- MoBA applies a Mixture-of-Experts routing idea to attention blocks and can switch between sparse and full attention, and it runs in production behind Kimi (Lu et al., 2025, arXiv:2502.13189).
- DSA uses a lightweight "lightning indexer" to score tokens in FP8 and attend to only the top-k, cutting per-token cost from \(O(L^2)\) toward \(O(Lk)\) and halving DeepSeek's published API prices.
- The open tension: hardware-aligned sparsity that is fast in practice, not just sparse on paper, and selection that survives training rather than fighting it.
At a Glance
NSA replaces one dense attention call with three cheaper paths whose outputs are blended by a per-query gate. The compressed path gives global awareness, the selected path recovers fine detail from the few blocks that matter, and the sliding window guarantees local fidelity.
flowchart LR
Q[Query token] --> CMP[Compressed tokens<br/>coarse global]
Q --> SEL[Selected blocks<br/>fine detail]
Q --> WIN[Sliding window<br/>local recent]
CMP --> G{Learned gate}
SEL --> G
WIN --> G
G --> O[Attention output]
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
class Q blue
class CMP,SEL,WIN,G purple
class O teal
The crucial property is that the gate and the selection are differentiable, so training can teach each branch when to defer to the others.
[IMAGE: Side-by-side line plot of long-context benchmark accuracy versus sequence length for full attention and NSA, overlaid with a second axis showing decoding latency, illustrating matched accuracy with diverging latency curves.]
Before Native Sparsity
The quadratic cost of attention was visible in the original Transformer (Vaswani et al., 2017, Attention Is All You Need, arXiv:1706.03762), but it did not bite until people wanted to feed models whole documents. The first response was structural sparsity: fix a pattern of which tokens may attend to which, chosen by a human rather than the data.
Longformer paired a local sliding window with a few global tokens and brought attention cost down to linear in sequence length (Beltagy et al., 2020, Longformer, arXiv:2004.05150). BigBird added random connections and proved that local-plus-global-plus-random attention is still a universal sequence approximator (Zaheer et al., 2020, Big Bird, arXiv:2007.14062). These worked, but the pattern was static; it could not notice that for one particular query, the single relevant fact sat in a block the fixed mask happened to exclude.
A second wave attacked the KV cache at inference time. H2O observed that a small set of "heavy hitter" tokens accumulates most of the attention mass and evicted the rest (Zhang et al., 2023, H2O, arXiv:2306.14048). StreamingLLM found that the first few tokens act as attention sinks and that keeping them plus a recent window lets a model stream indefinitely (Xiao et al., 2023, arXiv:2309.17453). Quest made the selection query-aware, scoring KV-cache pages against the current query and loading only the top ones (Tang et al., 2024, Quest, arXiv:2406.10774).
All of these are post-hoc. The model was pretrained with full attention; sparsity was applied afterward to save memory or time. That mismatch is the gap the 2025 methods close.
timeline title From fixed masks to trainable sparsity 2017 : Transformer : full dense attention 2020 : Longformer and BigBird : fixed sparse patterns 2023 : H2O and StreamingLLM : inference-time KV eviction 2024 : Quest : query-aware page selection 2025 : NSA, MoBA, DSA : sparsity trained end to end
[IMAGE: A two-panel schematic contrasting a fixed Longformer band-plus-global attention mask (left) against a query-dependent NSA selection mask (right), with the same query row highlighted to show the right panel reaching a distant block the left one misses.]
How Native Sparse Attention Actually Works
The full attention output for query \(q_t\) at position \(t\) is a softmax-weighted sum over all preceding keys and values:
\[o_t = \text{softmax}\!\left(\tfrac{q_t K_{1:t}^\top}{\sqrt{d_k}}\right) V_{1:t}\]The cost of this grows with \(t\), and summed over a sequence of length \(L\) it is \(O(L^2 d_k)\). NSA keeps the softmax but shrinks the set of keys and values each query sees, splitting the work into three branches and learning how to weight them (Yuan et al., 2025, arXiv:2502.11089).
Branch one: compression for global context
The preceding tokens are cut into blocks (the paper uses a block length of 32 with a stride of 16). Each block of keys is fed through a small learnable MLP that maps the whole block to a single compressed key, and likewise for values. A 64k-token context with block length 32 collapses to roughly 2,000 compressed keys. Attending over those gives the query a cheap, coarse view of the entire history, enough to tell which regions are worth a closer look.
Branch two: selection for local precision
Compression loses detail, so NSA uses the compression-branch attention scores as an importance signal. It scores blocks, picks the top \(n\) (the paper selects 16 blocks of size 64), and then attends at full resolution to only those tokens. This is where the model recovers the exact wording of the few passages that matter. The selection is blockwise on purpose: contiguous blocks are what GPUs load efficiently, so the sparsity maps onto hardware instead of fighting it.
Branch three: sliding window for locality
Recent tokens almost always matter, and letting the other two branches relearn that every step wastes capacity. A dedicated sliding window (512 tokens in the paper) handles local context directly. Isolating it stops the compression and selection branches from being dominated by the strong, easy local signal, a "shortcut" the authors explicitly designed around.
Fusing the branches
The three outputs are combined by a gate produced from the query through a small MLP with a sigmoid:
\[o_t = \sum_{c \in \{\text{cmp},\,\text{slc},\,\text{win}\}} g_t^{c}\,\cdot\,\text{Attn}(q_t, K_t^{c}, V_t^{c})\]Because \(g_t^c\) and the compression MLP are differentiable, gradients reach every branch. The model does not merely tolerate sparsity; it is trained from scratch under it, so its weights co-adapt with the selection. That is the "native" in Native Sparse Attention.
[IMAGE: Anatomy figure of the NSA attention block, three horizontal lanes (compress, select, sliding window) each showing their key-value sets shrinking from the full sequence, converging into a gate node labeled with sigmoid, annotated with the paper's block sizes 32, 64, and 512.]
Seeing It in Motion
During autoregressive decoding, the expensive part is reading the KV cache for each new token. NSA and DSA both insert a cheap scoring step before the costly attention, so the heavy read touches only a fraction of the cache.
sequenceDiagram participant Q as New query participant IDX as Indexer or compression participant SEL as Top-k selector participant KV as KV cache participant OUT as Output Q->>IDX: score all prior tokens cheaply IDX->>SEL: importance scores SEL->>KV: request only top-k blocks KV-->>OUT: gather selected keys and values Q->>OUT: full attention over the small set Note over IDX,KV: heavy read shrinks from L to k
The architecture view shows where the branches sit inside a layer. Each query head is grouped (NSA builds on Grouped-Query Attention) so that a whole group shares one selection decision, which keeps the gather coalesced across heads.
graph TD
IN[Hidden state] --> QP[Query projection]
QP --> CB[Compression branch]
QP --> SB[Selection branch]
QP --> WB[Window branch]
CB --> GT{Gate MLP}
SB --> GT
WB --> GT
GT --> MERGE[Weighted sum]
MERGE --> OUTP[Output projection]
OUTP --> NEXT[Next layer]
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
class IN,QP blue
class CB,SB,WB,GT purple
class MERGE teal
class OUTP,NEXT slate
The contrast with post-hoc methods is cleanest as a flow of gradients. In a bolt-on scheme, the selection is a hard, non-differentiable cut applied after training; in a native scheme, training sees the same sparse path inference will use.
flowchart TB
subgraph bolton["Bolt-on"]
T1[Train with full attention] --> P1[Prune KV at inference]
P1 --> M1[Train and test mismatch]
end
subgraph nativegrp["Native"]
T2[Train with sparse path] --> P2[Same path at inference]
P2 --> M2[No mismatch]
end
classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
class T1,P1,M1 rose
class T2,P2,M2 emerald
[IMAGE: Annotated trace snippet rendered as a figure, showing for a single decoding step the indexer scores as a bar strip over 64k positions, with the selected top-k blocks highlighted and a caption giving the count of tokens actually read.]
By the Numbers
The headline NSA results are speedups measured on 64k-length sequences against a tuned full-attention baseline, with model quality held at or above the dense model. NSA was pretrained as a 27B-parameter model (with Grouped-Query Attention and Mixture-of-Experts layers) on 260B tokens, then evaluated on knowledge, reasoning, and coding benchmarks including MMLU, GSM8K, MATH, and HumanEval (Yuan et al., 2025, arXiv:2502.11089).
| Stage | NSA speedup at 64k | Source |
|---|---|---|
| Decoding | up to 11.6x | NSA paper, Figure 1 |
| Forward pass | up to 9.0x | NSA paper |
| Backward pass | up to 6.0x | NSA paper |
The asymmetry is informative. Decoding gains most because it is memory-bound: the win comes from reading fewer KV entries, and selection shrinks that read the most. Forward and backward passes are more compute-bound, so the gain tracks the reduction in arithmetic rather than memory traffic.
[IMAGE: Grouped bar chart of NSA speedup factors (decode, forward, backward) at 64k, with each bar annotated by whether the stage is memory-bound or compute-bound.]
The complexity picture explains why the gap widens with length:
| Method | Per-token attention cost | KV read per step | Notes |
|---|---|---|---|
| Full attention | \(O(L)\) | \(O(L)\) | quadratic over a sequence, \(O(L^2)\) |
| Fixed sparse (Longformer) | \(O(w)\) | \(O(w)\) | \(w\) = window, but pattern is static |
| Native selection (NSA, DSA) | \(O(k)\) | \(O(k)\) | \(k \ll L\), selected per query |
DSA reduces the dominant attention term from \(O(L^2)\) toward \(O(Lk)\) with \(k \ll L\), and DeepSeek translated that into an API price cut of more than 50%, to under three cents per million input tokens, when it shipped V3.2-Exp (VentureBeat, 2025; TechCrunch, 2025). V3.2-Exp is a 685B-parameter open-weight model supporting a 128k context, released under the MIT license.
[IMAGE: Log-log plot of attention FLOPs versus sequence length from 4k to 1M, three lines for full O(L squared), fixed-window O(Lw), and native top-k O(Lk), with a shaded band marking where the quadratic curve crosses the cost of the rest of the model.]
A Concrete Example
Take one query at position 64,000 in a 64k-token context, using NSA's published configuration: compression block length 32, selection of 16 blocks at block size 64, and a 512-token sliding window.
Full attention would have this query read all 64,000 keys and values. NSA instead does the following.
The compression branch sees the history as blocks of 32 tokens, so it attends over about $64000 / 32 = 2000$ compressed keys. From the compression-branch scores it ranks the underlying selection blocks and keeps the top 16 at block size 64, which is $16 \times 64 = 1024$ tokens read at full resolution. The sliding window adds the most recent 512 tokens directly.
The query therefore attends over roughly $2000 + 1024 + 512 \approx 3536$ key-value entries instead of 64,000, about 5.5% of the dense work. The gate then blends the three results. Suppose for this query the relevant fact, a definition introduced near token 12,000, lies inside one selection block; the compression branch flags that region as high-scoring, the selection branch pulls those 64 tokens in at full fidelity, and the sliding window contributes nothing useful because the fact is far away, so its gate weight stays low. The output is nearly identical to full attention at a fraction of the reads.
| Branch | Keys read | What it contributes |
|---|---|---|
| Compression | ~2,000 | coarse map of all 64k tokens |
| Selection | 1,024 | exact text of the 16 best blocks |
| Sliding window | 512 | recent local context |
| Total | ~3,536 | versus 64,000 for full attention |
Replaying this on paper for a different query, one whose answer is purely local, the gate would instead lean on the sliding window and down-weight selection. The same machinery serves both because the gate is learned per query.
[IMAGE: A heatmap over the 64k context for this worked query, columns colored by which branch read each token (compression coarse band, selection hot blocks, sliding-window tail), with the relevant block near token 12k circled.]
Where It Breaks
Native sparsity is not free of sharp edges. The first is the gap between theoretical sparsity and wall-clock speed. A method can attend to 5% of tokens and still run slowly if those tokens are scattered, because GPUs move memory in contiguous chunks. NSA's blockwise selection exists precisely to keep reads coalesced; token-level selection that ignores this can be sparse on paper and slow in practice, which is the trap the NSA authors call out when they argue many prior methods do not deliver their promised speedups in real kernels.
Second, low selection budgets can miss a needle. If the one relevant token sits in a block the scorer ranks 17th and only 16 are kept, the model never sees it. Native training mitigates this because the model learns to make important content score highly, but adversarial or highly dispersed retrieval tasks can still expose the budget.
Third, short sequences see little or no benefit and can pay overhead. When \(L\) is only a few thousand tokens, the indexer, gate, and three-branch bookkeeping add cost that full attention avoids, which is why MoBA is built to fall back to full attention rather than force sparsity everywhere (Lu et al., 2025, arXiv:2502.13189).
Fourth, the indexer is an approximation trained with its own objective. DSA's lightning indexer is a lightweight scorer, and a scorer that drifts from the true attention distribution will select the wrong tokens; keeping it cheap (FP8, few heads) and still accurate is a live engineering constraint, not a solved problem.
Alternative Designs
The three native methods share a goal but make different bets about where the selection lives and how coarse it is.
| Approach | Mechanism | Strengths | Weaknesses | Best when |
|---|---|---|---|---|
| NSA | Three branches (compress, select, window) with a learned gate | Hardware-aligned blockwise reads, strong long-context quality | More moving parts, tuned block sizes | Training a model natively for long context |
| MoBA | MoE-style routing of queries to key blocks | Switches between full and sparse, drop-in for existing models | Block routing granularity limits precision | Retrofitting long context onto a trained model |
| DSA (V3.2) | Lightning indexer scores tokens, top-k selection feeds MLA | Fine-grained token selection, large production cost cut | Indexer accuracy is a separate thing to train | Serving very long contexts cheaply at scale |
| Post-hoc KV (H2O, Quest) | Select or evict KV entries at inference only | No retraining, works on any dense model | Train-test mismatch, accuracy ceiling | Cutting memory on an existing dense model |
The honest comparison is that post-hoc methods remain the pragmatic choice when retraining is off the table, and they have improved a lot. The native methods win when you control pretraining and want the model's weights themselves to expect sparsity. MoBA occupies a useful middle ground, since its ability to toggle back to full attention lets teams adopt it without betting the whole training run.
[IMAGE: A 2x2 positioning chart with axes "requires retraining" and "selection granularity (coarse to fine)", placing Longformer, H2O, Quest, MoBA, NSA, and DSA as labeled points.]
How It Is Used in Practice
The clearest production signal is DeepSeek-V3.2-Exp, which ships DSA in an open-weight 685B-parameter model and passed the savings to customers as a published price cut the same day (DeepSeek API Docs, 2025). The model has been integrated into serving stacks; the vLLM team documented running V3.2-Exp's fine-grained sparse attention in production inference (vLLM Blog, 2025). The selection step runs in low precision (FP8) so that the scoring overhead stays small relative to the attention it saves.
MoBA reports deployment behind Kimi's long-context requests, where the ability to switch between sparse and full attention matters operationally: short conversations use full attention, long documents trigger the sparse path, and the same weights serve both (Lu et al., 2025, arXiv:2502.13189; MoonshotAI MoBA repository).
For teams not training their own frontier model, the practical takeaway is narrower. Native sparsity is a pretraining-time decision; you adopt it by choosing a model built with it, not by flipping a flag at inference. What you can do at inference is the post-hoc family, which is why H2O- and Quest-style KV management still ships in serving frameworks even as native methods spread.
[IMAGE: A before/after bar comparison of published API input-token price for a long-context request on a dense-attention model versus DeepSeek V3.2-Exp, annotated with the more-than-50% reduction.]
Insights Worth Remembering
The decisive change is not "less attention" but "attention the model was trained to expect." Sparsity stops being a lossy approximation of a dense model and becomes the model's own behavior.
Hardware alignment is part of the algorithm, not an afterthought. Blockwise selection beats token-wise selection in practice not because it is more accurate but because GPUs reward contiguous reads, and a method that ignores memory layout can be sparse and slow at once.
Decoding and training feel sparsity differently. Decoding is memory-bound, so it benefits most from reading fewer KV entries; the forward and backward passes are more compute-bound, so their gains track arithmetic saved.
Coarse and fine views are complementary. Compression tells the model where to look, selection tells it what is there, and neither alone is enough, which is why the strongest design keeps both plus a local window.
The price of long context is now a design variable. When attention cost is \(O(Lk)\) instead of \(O(L^2)\), a vendor can choose how much context to sell at what margin, and the September 2025 price cut is what that choice looks like from the outside.
Open Questions
Whether native sparsity generalizes to the hardest retrieval tasks remains genuinely open. The benchmark results show parity with full attention on standard suites, but it is an open question whether a fixed selection budget can match dense attention on adversarial needle-in-a-haystack tasks where the relevant token is engineered to score low. This is where the evidence is thinnest.
The right granularity is unsettled. NSA bets on blocks for hardware reasons; DSA pushes toward finer token-level selection with a cheap indexer. Which trade-off dominates likely depends on hardware that is still changing, so today's answer may not be next year's.
Indexer co-training is an active problem. A separate scorer that must stay both cheap and faithful to the true attention distribution is a moving target, and how to train it so it does not silently degrade on out-of-distribution inputs is not settled.
Finally, the interaction with other efficiency tricks (quantization, Multi-head Latent Attention, speculative decoding) is largely unexplored in the open literature. DeepSeek combines DSA with MLA in V3.2, which suggests these stack, but the general rules for composing sparse attention with the rest of the efficiency toolkit are still being written, and most public claims here are extrapolation rather than measurement.
Sources and Further Reading
Foundational Papers
- Vaswani et al., 2017, Attention Is All You Need, arXiv:1706.03762
- Beltagy et al., 2020, Longformer: The Long-Document Transformer, arXiv:2004.05150
- Zaheer et al., 2020, Big Bird: Transformers for Longer Sequences, arXiv:2007.14062
The Native Sparse Attention Wave
- Yuan et al., 2025, Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention, arXiv:2502.11089 (ACL 2025)
- Lu et al., 2025, MoBA: Mixture of Block Attention for Long-Context LLMs, arXiv:2502.13189
- DeepSeek-AI, 2025, Introducing DeepSeek-V3.2-Exp, API documentation
Inference-Time KV Selection
- Zhang et al., 2023, H2O: Heavy-Hitter Oracle for Efficient Generative Inference, arXiv:2306.14048
- Xiao et al., 2023, Efficient Streaming Language Models with Attention Sinks, arXiv:2309.17453
- Tang et al., 2024, Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference, arXiv:2406.10774
Technical Blogs and Engineering Notes
Related reading
-
The Moderation Tax: How Guardrail Classifiers Trade Latency for Coverage
22 min read
-
When the Judge Is Also a Player: LLM-as-Judge, Contamination, and Why Leaderboards Drift
22 min read
-
Multi-Agent Orchestration Patterns: When Coordination Beats One Agent, and When It Just Multiplies Cost
19 min read