Half Mamba, Half Attention: Why Hybrid State-Space Models Took Over

In December 2023, a paper called Mamba arrived with a claim that read like a challenge to the entire field: a sequence model with no attention at all, scaling linearly with length, matching Transformers on language while running five times faster at generation (Gu and Dao, 2023, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arXiv:2312.00752). For a few months the question on everyone's mind was whether attention was finished.

It was not. Two years later the verdict is stranger and more interesting than "Mamba wins" or "Mamba loses." The strongest open-weight efficiency models of 2025, NVIDIA's Nemotron-H, AI21's Jamba, and TII's Falcon-H1, are all hybrids. They keep a small fraction of attention layers and replace the rest with state-space layers. The pure-SSM revolution did not happen. A quieter takeover did.

Why this matters: The cost of running a long-context model is dominated not by its parameters but by its KV cache, which grows with every token in the conversation. State-space layers carry a fixed-size state instead. Mixing the two is how production models in 2025 cut inference memory by an order of magnitude without losing the recall that attention provides.

TL;DR

A state-space model (SSM) processes a sequence with a fixed-size recurrent state, giving it $O(1)$ memory and $O(1)$ compute per generated token, versus attention's KV cache that grows as $O(L)$ in both.
Mamba's contribution was selectivity: making the recurrence input-dependent so the model can choose what to remember, which is what earlier SSMs like S4 could not do. The cost is that the efficient convolution view disappears and you need a hardware-aware parallel scan.
Pure SSMs have a provable ceiling. They cannot reliably copy long strings from context (Jelassi et al., 2024) and cannot track state outside the complexity class $TC^0$ (Merrill et al., 2024). These are not engineering bugs; they follow from the fixed-size state.
Attention has the opposite profile: perfect random-access recall, quadratic cost. The hybrid bet is that you only need a few attention layers to recover recall, and Mamba layers everywhere else for efficiency.
Real ratios converged near one attention layer for every seven to eleven SSM layers. Nemotron-H reports up to 3x faster inference than comparable Transformers at matched accuracy (NVIDIA, 2025, arXiv:2504.03624).
The deciding factor was not training quality but inference economics during long-context and reasoning workloads, where the KV cache, not the weights, fills the GPU.

At a Glance

flowchart LR
  X[Token stream] --> ATT["Attention layer<br/>random-access recall"]
  X --> SSM["Mamba layer<br/>fixed-size state"]
  ATT --> KV["KV cache<br/>grows with length"]
  SSM --> ST["State vector<br/>constant size"]
  KV --> OUT[Next token]
  ST --> OUT
  class X blue
  class ATT,SSM purple
  class KV amber
  class ST emerald
  class OUT teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

A hybrid model routes the same token stream through both kinds of layer. The attention layers pay a memory cost that scales with the conversation; the Mamba layers do not. The art is in the ratio.

[IMAGE: Side-by-side schematic of an attention layer's growing KV cache versus an SSM layer's fixed state vector, with a token counter ticking from 1 to 100,000 and the cache bar growing while the state bar stays flat]

Before the Hybrids

The story starts with a problem attention created for itself. Self-attention compares every token to every other token, so both its compute and its memory grow quadratically with sequence length. For short inputs nobody cared. As context windows stretched from a few hundred tokens toward hundreds of thousands, the quadratic term became the whole bill.

Recurrent networks had the opposite shape. An RNN carries a fixed-size hidden state forward one step at a time, so memory is constant and compute is linear in length. The trouble was twofold: RNNs are slow to train because the recurrence is inherently sequential, and they forget. Gradients vanish, long-range dependencies wash out, and content-based recall is poor.

The structured state-space line of work tried to get the RNN's inference profile without its training and memory weaknesses. S4 (Gu, Goel, and Ré, 2021, Efficiently Modeling Long Sequences with Structured State Spaces, arXiv:2111.00396) showed that a carefully parameterized linear SSM could be unrolled into a long convolution, trained in parallel like a CNN, and still run as a recurrence at inference. On the Long Range Arena benchmark, built specifically to test dependencies over thousands of steps, S4 beat Transformers decisively and generated far faster. But S4's dynamics were fixed and input-independent, which made it weak at the content-based reasoning language demands.

[IMAGE: Two views of the same S4 layer side by side, the parallel convolution form used for training and the step-by-step recurrence form used for inference, with an arrow showing they compute the same function]

Mamba broke that ceiling by making the SSM selective. Then came the realization, formalized in Mamba-2, that a structured SSM and a form of masked attention are two algorithms for the same underlying operation (Dao and Gu, 2024, Transformers are SSMs, arXiv:2405.21060). That duality is what made hybrids feel natural rather than like bolting two unrelated machines together.

timeline
  title From recurrence to hybrids
  2017 : Transformer : attention is all you need
  2021 : S4 : structured SSM wins Long Range Arena
  2023 : Mamba : selective SSM matches Transformers
  2024 : Mamba-2 and Jamba : SSM-attention duality, first hybrid at scale
  2025 : Nemotron-H, Falcon-H1 : hybrids as default efficiency models

How Selective State Spaces Actually Work

A continuous state-space model is a pair of linear equations borrowed from control theory. A hidden state $h(t)$ evolves under a matrix $A$, is driven by the input $x(t)$ through $B$, and is read out through $C$:

\[h'(t) = A\,h(t) + B\,x(t), \qquad y(t) = C\,h(t)\]

To run this on token sequences you discretize it with a step size $\Delta$, which turns the continuous matrices into their discrete counterparts $\bar{A}$ and $\bar{B}$. The model then becomes a plain recurrence over positions:

\[h_t = \bar{A}\,h_{t-1} + \bar{B}\,x_t, \qquad y_t = C\,h_t\]

Read that recurrence carefully and the inference economics fall out immediately. To produce token $y_t$ you need only the previous state $h_{t-1}$ and the current input. The state $h_t$ has a fixed size that does not depend on how many tokens came before. There is no cache to grow. Generation is constant time and constant memory per token, no matter whether you are 100 tokens or 100,000 tokens into the sequence.

Why earlier SSMs underperformed

In S4 the matrices $\bar{A}$, $\bar{B}$, and $C$ are the same for every token. That makes the recurrence a linear time-invariant system, which has a beautiful property: unrolling it across the sequence is exactly a convolution, and convolutions train fast in parallel on a GPU. The price is that the model processes every token with identical dynamics. It cannot look at the current token and decide this one matters, remember it or this is filler, skip it. For audio or pixels that is fine. For language, where a single name or number might need to survive thousands of tokens, it is fatal.

Selectivity, and what it costs

Mamba's move is to make $B$, $C$, and $\Delta$ functions of the input rather than fixed parameters. Now the recurrence can modulate itself token by token: a large $\Delta$ effectively resets the state to focus on new information, a small $\Delta$ lets old state persist. This is the "selection" mechanism, and it is what closed the language gap.

[IMAGE: Heatmap of the step size delta across a sample sentence, spiking at content words like names and numbers and dropping near stopwords, illustrating what the model chooses to remember]

There is no free lunch. The moment the dynamics depend on the input, the system is no longer time-invariant, and the convolution trick that made S4 trainable evaporates. Mamba recovers parallel training a different way, with a hardware-aware parallel scan that keeps the state in fast SRAM and avoids ever writing the full set of intermediate states to slow GPU memory. This is the same IO-aware philosophy behind FlashAttention, applied to a recurrence. Mamba-2 then simplified the scan into a matrix-multiplication-friendly form, the structured state-space duality, which runs 2 to 8 times faster than the original Mamba scan on modern accelerators (Dao and Gu, 2024, arXiv:2405.21060).

flowchart TD
  T["Token x_t"] --> P["Project to B_t, C_t, delta_t<br/>input-dependent"]
  P --> D["Discretize A, B with delta_t"]
  D --> U["Update state:<br/>h_t = Abar h_t-1 + Bbar x_t"]
  H["Previous state h_t-1"] --> U
  U --> R["Read out: y_t = C_t h_t"]
  U --> N["Carry h_t forward"]
  R --> Y["Output y_t"]
  class T blue
  class P,D,U,R purple
  class H,N slate
  class Y teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The diagram shows the per-token computation. Notice that nothing in the loop references tokens other than the immediate predecessor through the state. That locality is the source of both the efficiency and, as we will see, the fundamental limitation.

Seeing It in Motion

The cleanest way to feel the difference between the two layer types is to watch them generate a token deep into a long sequence. Attention reaches back across the entire history; the SSM consults only its compressed state.

sequenceDiagram
  participant U as Decoder step
  participant A as Attention layer
  participant K as KV cache (length L)
  participant M as Mamba layer
  participant S as State (fixed)
  U->>A: query for token L+1
  A->>K: read all L keys/values
  K-->>A: O(L) work, O(L) memory
  A-->>U: attended output
  U->>M: same token L+1
  M->>S: read fixed state
  S-->>M: O(1) work, O(1) memory
  M->>S: write updated state
  M-->>U: recurrent output
  Note over K,S: At L = 100k the gap is enormous

Now consider the decision a model architect faces: given a fixed budget of layers, which should be attention and which should be Mamba? The answer depends on what each layer is being asked to do.

flowchart TD
  Q{Does this layer need<br/>exact long-range recall?}
  Q -->|Yes, must copy or retrieve| ATT[Use attention]
  Q -->|No, local mixing is enough| SSM[Use Mamba]
  ATT --> C1["Pay O(L) KV cache"]
  SSM --> C2["Pay O(1) state"]
  C1 --> B["Most layers: Mamba<br/>A few layers: attention"]
  C2 --> B
  class Q amber
  class ATT rose
  class SSM emerald
  class C1 amber
  class C2 emerald
  class B teal
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff

This is the entire design philosophy of the hybrids in one picture. You do not need every layer to have perfect recall. You need some layers to, and the rest can be cheap.

By the Numbers

The complexity table below is the reason the field moved. The asymmetry between training and decoding is what matters: during autoregressive generation, attention pays for its cache on every single token.

Property	Attention	S4 (fixed SSM)	Mamba (selective)
Training compute	$O(L^2)$	$O(L \log L)$ via convolution	$O(L)$ via parallel scan
Decode compute / token	$O(L)$	$O(1)$	$O(1)$
Decode memory / token	$O(L)$ KV cache	$O(1)$ state	$O(1)$ state
Input-dependent dynamics	yes	no	yes
Exact long-range recall	yes	weak	weak

The reported speedups track that table. Mamba claimed roughly 5x higher generation throughput than a similarly sized Transformer (Gu and Dao, 2023, arXiv:2312.00752). Mamba-2's duality kernel is 2 to 8x faster than Mamba's original scan (Dao and Gu, 2024, arXiv:2405.21060). At the model level, NVIDIA reports Nemotron-H being up to 3x faster at inference than Qwen-2.5 and Llama-3.1 models of comparable size while matching or beating their accuracy (NVIDIA, 2025, arXiv:2504.03624).

The hybrid ratios that production teams settled on are themselves data. Jamba interleaves blocks containing roughly one attention layer for every seven other layers, mixing in mixture-of-experts to grow capacity without growing active parameters, and fits a 256K-token context with a far smaller KV footprint than a pure Transformer (Lieber et al., 2024, Jamba, arXiv:2403.19887; AI21, 2024, Jamba-1.5, arXiv:2408.12570). Nemotron-H replaces the large majority of self-attention layers with Mamba, keeping only a small fraction as attention. Falcon-H1 takes a different geometry entirely, running attention and Mamba-2 heads in parallel inside one mixer block and tuning the channel ratio between them, with a 34B model competitive against 70B-class Transformers (Falcon-LLM Team, 2025, arXiv:2507.22448).

[IMAGE: Grouped bar chart of decode throughput (tokens/sec) versus context length at 4K, 32K, 128K for a pure Transformer, pure Mamba, and a hybrid, showing the Transformer curve collapsing as length grows while Mamba and hybrid stay nearly flat]

A Concrete Example

Numbers make the KV cache problem visceral. Take a Transformer decoder with 32 layers, hidden size 4096, using grouped-query attention with 8 key-value heads of dimension 128, serving in fp16 (2 bytes per value). The KV cache stores a key and a value per layer per head per token:

\[\text{bytes/token} = 2 \times n_{\text{layers}} \times n_{\text{kv heads}} \times d_{\text{head}} \times 2\]

Plugging in: $2 \times 32 \times 8 \times 128 \times 2 = 131{,}072$ bytes, about 128 KB for every token in context. These are illustrative figures for a representative configuration, not a measurement of a specific released model, but the arithmetic is exact.

Now play the conversation forward:

Context length	Transformer KV cache	Hybrid (4 attention layers)	Mamba state
8K tokens	~1.0 GB	~0.13 GB	fixed, ~MBs
32K tokens	~4.0 GB	~0.5 GB	fixed, ~MBs
128K tokens	~16 GB	~2.0 GB	fixed, ~MBs
256K tokens	~32 GB	~4.0 GB	fixed, ~MBs

At 128K tokens the full-attention model is spending 16 GB on cache alone, before a single weight is loaded, and that cost is per concurrent request. Replace 28 of the 32 attention layers with Mamba and the cache shrinks by a factor of eight, because only 4 layers still maintain one. The Mamba layers contribute a fixed-size recurrent state regardless of length, measured in megabytes, which rounds to noise on this chart.

This is the whole argument for hybrids in one table. The weights might be 16 GB for a model this size; at long context the KV cache of a pure Transformer matches or exceeds the entire weight memory. A serving cluster that can hold eight Transformer requests in memory can hold dozens of hybrid requests, which is the difference between a viable product and an unviable one. The savings show up exactly where modern workloads live: long documents, long agent transcripts, and reasoning chains that emit tens of thousands of tokens before answering.

Where It Breaks

If hybrids were strictly better, there would be no attention layers at all. The reason there always are comes down to two proven limitations of the fixed-size state.

The first is copying and retrieval. Jelassi and colleagues proved that a two-layer Transformer can copy strings of length exponential in its size, while any generalized state-space model is fundamentally bounded by its fixed state: once the string to be copied exceeds what the state can hold, recall degrades (Jelassi et al., 2024, Repeat After Me, arXiv:2402.01032). Empirically, Transformers dramatically outperform SSMs at copying and at retrieving a specific fact from a long context. This is intuitive: attention keeps every token addressable, whereas an SSM has compressed the past into a vector and cannot un-compress an arbitrary detail on demand.

The second is state tracking, and it is deeper. Merrill, Petty, and Sabharwal showed that despite their recurrent appearance, SSMs cannot express computations outside the complexity class $TC^0$, the same ceiling that limits Transformers (Merrill et al., 2024, The Illusion of State in State-Space Models, arXiv:2404.08819). Tasks like tracking a chess game move by move, evaluating code, or following which entity a pronoun refers to across a long narrative are provably out of reach for a single linear SSM pass. The paper's title is the thesis: the "state" in a linear state-space model is, in the formal sense that matters, an illusion. A nonlinear RNN can track state; a linear SSM trades that ability away for the parallel scan that makes it fast.

There are softer failure modes too. The selection mechanism adds overhead that means at short sequence lengths a Mamba layer can actually be slower than well-optimized attention, since attention's quadratic cost is cheap when $L$ is small and the constant factors dominate. Hybrids inherit the engineering complexity of both worlds: two kinds of kernel to optimize, two memory profiles to reason about, and a mixing ratio that is itself a hyperparameter nobody can derive from first principles. And the duality that makes Mamba-2 elegant comes with a constraint, a scalar-times-identity structure on the state matrix, that trades some expressiveness for speed.

[IMAGE: A "needle in a haystack" retrieval heatmap comparing a pure Mamba model and a hybrid across context positions, showing the pure model's recall fading at depth while the hybrid stays uniform]

Alternative Designs

The hybrid is not the only response to attention's quadratic cost. It is the one that has shipped most broadly, but the alternatives clarify why.

Approach	Strengths	Weaknesses	Best when
Pure Transformer	Perfect recall, mature tooling, best raw quality per token	$O(L^2)$ training, $O(L)$ KV cache at decode	Context is short or memory is abundant
Pure Mamba / SSM	$O(1)$ decode memory and compute, linear training	Weak copying, no exact state tracking	Streaming, audio, genomics, very long low-recall inputs
Sequential hybrid (Jamba, Nemotron-H)	Most efficiency of Mamba, recall of a few attention layers	Two kernels to maintain, ratio is empirical	General-purpose long-context language models
Parallel hybrid (Falcon-H1)	Both heads see every token, tunable channel ratio	More complex mixer block, newer and less battle-tested	Squeezing maximum quality from a given parameter budget
Sparse / linear attention	Stays within the attention framework, drop-in for some stacks	Approximation can miss dependencies, uneven quality	Retrofitting existing Transformer training pipelines

The split between sequential and parallel hybrids is worth dwelling on. Jamba and Nemotron-H stack whole attention layers and whole Mamba layers in sequence, like alternating floors in a building. Falcon-H1 instead puts attention heads and Mamba-2 heads side by side within the same layer and concatenates their outputs, so every layer sees both views of the sequence. The sequential design is simpler to reason about and reuse; the parallel design gives finer control over how much of each mechanism the model spends its budget on, at the cost of a more intricate block.

How It Is Used in Practice

By 2025 the hybrid had moved from research curiosity to the default choice for open efficiency-oriented models. AI21 ships Jamba as a commercial long-context model. NVIDIA released the Nemotron-H family at 8B and 56B, then used a pruning-and-distillation technique they call MiniPuzzle to compress the 56B into a 47B variant that fits more comfortably for inference, and extended the line into reasoning-tuned models like Nemotron Nano 2 (NVIDIA, 2025, arXiv:2508.14444). TII's Falcon-H1 spans 0.5B to 34B with the parallel-head design and has been integrated into mainstream training frameworks.

The sequential hybrids share a recognizable block layout: a run of Mamba layers, a single attention layer to restore recall, and a feedforward or mixture-of-experts module, repeated up the stack.

graph TD
  IN[Token embeddings] --> B1
  subgraph Block["Repeated hybrid block"]
    B1["Mamba layer x N"] --> B2["Attention layer x 1"]
    B2 --> B3["MoE / FFN"]
  end
  B3 --> B1b["next block ..."]
  B1b --> NORM[Final norm]
  NORM --> LM[LM head]
  class IN blue
  class B1,B3 purple
  class B2 amber
  class B1b slate
  class NORM,LM teal
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

The single amber attention layer per block is the one paying the KV-cache cost; everything else carries a fixed state. Falcon-H1 collapses this picture by fusing the attention and Mamba paths into one parallel mixer rather than separate layers.

The common thread is the workload. These are not models chasing the top of a short-context leaderboard; they are built for long documents, long agent sessions, and reasoning traces. That is precisely where the KV cache dominates the memory budget and where constant-state layers pay off most. An agent that accumulates a 200K-token scratchpad of tool calls and observations is the worst case for a pure Transformer and the best case for a hybrid.

Operationally, the constant-memory layers also make serving more predictable. A pure-Transformer serving system has to plan for cache memory that grows with every active conversation, which complicates batching and admission control. Hybrids flatten most of that curve, so a fixed pool of GPU memory holds far more concurrent long-context requests, and tail latency is less sensitive to how long any single conversation has run.

[IMAGE: Stacked bar comparing total GPU memory at 128K context for a pure Transformer versus a hybrid, broken into weights, KV cache, and SSM state, showing the KV cache band shrinking dramatically in the hybrid]

Insights Worth Remembering

The KV cache, not the parameter count, is the hidden variable in long-context economics. Once you internalize that decode memory grows with conversation length in a Transformer and stays flat in an SSM, the entire hybrid movement reads as inevitable.

Selectivity was the unlock, and it was a deliberate trade. Making the recurrence input-dependent is what let SSMs handle language, and it is also what cost them the convolution shortcut and forced the parallel-scan kernel engineering.

The duality between SSMs and attention is more than a theoretical curiosity. It is why a single model can interleave both kinds of layer without friction; they are computing variations on the same structured operation, one in linear form and one in quadratic form.

Pure architectures lost to mixtures because the two mechanisms fail in opposite directions. Attention has perfect recall and bad asymptotics; SSMs have great asymptotics and a hard recall ceiling. A few percent of attention buys back most of the recall for a small fraction of the cost.

The limitations of SSMs are theorems, not bugs. Copying is bounded by state size and state tracking is bounded by $TC^0$. No amount of scaling makes a linear SSM track a chess game; you need attention, or nonlinearity, in the loop.

The winning ratio is empirical and surprisingly lopsided. Roughly one attention layer in eight to twelve is enough in practice, which tells you how little exact recall most layers actually need.

Open Questions

How much attention is the minimum? The ratios in shipped models cluster in a range, but whether the optimal fraction is stable across scale, or shrinks as models grow, is not settled. Current evidence suggests a small fraction suffices; whether it can approach zero for some workloads is an open empirical question.

Can the state grow smarter rather than larger? The copying limit comes from a fixed-size state. Whether mechanisms that adaptively expand or compress state, or that combine SSMs with explicit retrieval, can lift the ceiling without paying attention's full cost is active research. This is speculation on the field's trajectory, not a settled result.

Will reasoning workloads change the calculus? Models that emit very long chains of thought stress both recall (did the model remember step 12 at step 400?) and efficiency (the chain is the KV cache). Hybrids look well suited to this regime, and the reasoning-tuned Nemotron variants are an early bet on it, but whether the recall ceiling bites during long reasoning is not yet well characterized.

Does the parallel-head design generalize? Falcon-H1's choice to run attention and Mamba heads side by side is newer than the sequential stack. Whether it consistently outperforms sequential hybrids at a given budget, or simply offers a different point on the same tradeoff curve, will take more independent replication to know.