Orthogonalizing Memory Reads: A Muon Trick for Noisy Recurrent Recall
July 01, 2026 · 15 min read
A transformer recalls almost for free. Attention lets every token look back at every earlier token, so retrieving a value you saw ten thousand positions ago is a lookup, not a feat of memory. A recurrent model has no such luxury. It has to fold the entire past into a fixed-size state and hope the thing it needs later survived the compression. That is the recall bottleneck, and it is the single clearest reason recurrent architectures keep losing to attention on tasks that hinge on remembering.
A recent experimental note by Ayush Tambde makes a narrow, testable claim about that bottleneck. Take the strongest recurrent recall architecture available, the mLSTM from the xLSTM family, and at the moment it reads from its matrix memory, orthogonalize that memory first, using the same Newton-Schulz iteration that the Muon optimizer uses to orthogonalize gradient updates. Do not write the orthogonalized matrix back into the state; only reads see it. The report finds that this sharpens noisy associative recall, and that the gains are largest exactly where the plain mLSTM is falling apart (Tambde, 2026, Matrix Orthogonalization Improves Memory in Recurrent Models).
This piece does two things. It explains the mechanism, why an orthogonalization step at read time could possibly help a memory matrix retrieve better. Then it checks the claim: every citation, benchmark, and code artifact the note leans on, against the primary literature and the public repository. The short version is that the sourcing is clean and the result is real, but narrow, and the author is refreshingly upfront about the narrowness.
Why this matters: If a cheap, training-time-only tweak can close part of the recall gap between recurrent models and attention, it matters most in the regimes where attention is unaffordable: very long sequences, on-device inference, and long-horizon reinforcement learning where a world model has to remember across thousands of steps. That is a big "if," and the honest scope of this result is exactly what decides whether the "if" holds.
TL;DR
- The mLSTM stores its memory as a full matrix, updated by an outer-product ("covariance") rule, and reads from it by multiplying the matrix with a query. It is currently the strongest recurrent variant on associative recall.
- The intervention: before the read, replace the memory matrix \(C\) with its nearest orthogonal matrix \(\mathrm{Ortho}(C)\), computed by five Newton-Schulz iterations. Gradients flow through the step, but the raw \(C\) is what gets carried to the next timestep. Only reads are orthogonalized.
- The motivation is an analogy to Muon: orthogonalization flattens the singular-value spectrum, so weak directions stop being drowned out by dominant ones. In a memory matrix that means rare stored associations get a fairer hearing at retrieval.
- On the MAD noisy in-context recall task at
frac_noise = 0.8, the orthogonalized read wins in every regime tested, and the margin widens as the task gets harder. In the two hardest settings the baseline solves 4 of 24 seeds while the orthogonalized variant solves 14 to 16. - Every external claim checks out: the xLSTM description, the MQAR and MAD benchmarks, the Muon "tail-end associative memory" result, and a runnable public repo with matching CLI flags.
- The limits are real and stated: models around 78 thousand to 81 thousand parameters, one synthetic benchmark family, extra read-time compute the note does not quantify, and writing the orthogonalized matrix back into the state actually hurt.
At a Glance
The whole idea is a single detour inserted into the read path. The write path is the ordinary mLSTM update. When the model answers a query, it orthogonalizes a scratch copy of the memory, reads from that copy, and discards it.
flowchart LR
K[Key and value] --> W[Outer-product write<br/>update matrix memory C]
W --> C[Raw memory C carried<br/>to next timestep]
Q[Query q] --> R{Read time}
C -. scratch copy .-> O[Orthogonalize<br/>5 Newton-Schulz steps]
O --> RD[Readout uses Ortho C]
R --> RD
RD --> H[Output h]
class K,Q blue
class W,O purple
class C slate
class RD,H teal
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
The asymmetry is the point. The state that persists is untouched, so the recurrence is unchanged; only the lens through which each read looks at that state is orthogonalized.
[IMAGE: Two-panel schematic of one timestep. Left panel: the write path, a key-value outer product added to the matrix C. Right panel: the read path, C copied aside, passed through a Newton-Schulz block, then multiplied by the query. Annotate that the arrow from the orthogonalized copy back to C is deliberately absent.]
Before: why recurrent models struggle to recall
Associative recall is the task of storing key-value pairs and later retrieving the value for a given key. It sounds trivial, and for attention it nearly is. The difficulty for recurrent models is structural: a fixed-size state is a fixed budget, and every new pair written into it competes for the same capacity.
The benchmark that made this precise is MQAR, multi-query associative recall, introduced in the "Zoology" paper (Arora et al., 2023, Zoology: Measuring and Improving Recall in Efficient Language Models, arXiv:2312.04927). Zoology found that most of the quality gap between attention and gated-convolution models, about 82 percent of it, traced to associative recall, and that a small attention model could beat a much larger recurrent one on this one capability. MQAR formalized the task as retrieving multiple values at multiple positions in a sequence, which is closer to what real language demands than a single lookup.
timeline title From scalar cells to orthogonalized reads 1997 : LSTM introduces gated scalar memory 2020 : Longformer and BigBird approximate attention for long context 2023 : Zoology defines MQAR and quantifies the recall gap 2024 : xLSTM introduces mLSTM matrix memory and covariance update 2024 : MAD proposes noisy in-context recall as a scaling proxy 2024 : Muon orthogonalizes optimizer updates via Newton-Schulz 2026 : Read-time orthogonalization applied to the mLSTM memory
The mLSTM, from the xLSTM paper, is the response that matters here (Beck, Poppel, et al., 2024, xLSTM: Extended Long Short-Term Memory, arXiv:2405.04517). Instead of a classic LSTM's scalar memory cell, the mLSTM keeps a full matrix as its memory and updates it with a covariance rule inspired by Bidirectional Associative Memories, storing key-value bindings as outer products. That larger, structured memory is why the mLSTM is currently the strongest recurrent architecture on associative recall, and why it is the natural target for anyone trying to push recurrent recall further.
But clean recall is not the whole story. In a setting with noisy transitions, distractor tokens interleaved among the real pairs, a model also has to ignore what does not matter while still retrieving what does. That is a distinct skill from raw capacity, and it is where the note focuses.
How orthogonalized reads actually work
The mLSTM memory, briefly
The mLSTM's memory at timestep \(t\) is a matrix \(C_t\). Each new key-value pair is written by an outer product, gated by an input gate \(i_t\) and a forget gate \(f_t\):
\[C_t = f_t\, C_{t-1} + i_t\, v_t k_t^\top\]A separate normalizer state tracks the accumulated key mass:
\[n_t = f_t\, n_{t-1} + i_t\, k_t\]To read with a query \(q_t\), the model multiplies the memory by the query and divides by the normalizer to keep the scale bounded:
\[h_t = \frac{C_t\, q_t}{\max\!\left(\lvert n_t^\top q_t \rvert,\ 1\right)}\]The matrix \(C_t\) is where every association lives, superimposed. Retrieval is a matrix-vector product, and the quality of a read depends on how cleanly the query can pull its intended value out of that superposition without the other stored pairs bleeding in.
The cell has a clean anatomy: two gated inputs build the state, and two paths read it. The orthogonalization taps only the read path.
graph TD KV[Key and value inputs] --> WR[Outer-product write] IG[Input and forget gates] --> WR WR --> MEM[Matrix memory C] IG --> NRM[Normalizer state n] MEM --> TAP[Orthogonalize tap<br/>read path only] QRY[Query q] --> TAP NRM --> DIV[Scale by normalizer] TAP --> DIV DIV --> OUT[Output h] MEM --> NEXT[Carried to next step<br/>raw, untapped] class KV,QRY,IG blue class WR,TAP purple class MEM,NRM,NEXT slate class DIV,OUT teal classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
The intervention
The note changes exactly one thing. At read time, it replaces \(C_t\) with its nearest orthogonal matrix before the query multiply:
\[\tilde h_t = \frac{\mathrm{Ortho}(C_t)\, q_t}{\max\!\left(\lvert n_t^\top q_t \rvert,\ 1\right)}\]where, for a singular value decomposition \(C_t = U S V^\top\), the nearest semi-orthogonal matrix is
\[\mathrm{Ortho}(C_t) = U V^\top.\]Three details make this an intervention rather than a rewrite of the architecture. First, gradients flow through the orthogonalization, so the model trains with the operation in the loop and adapts its weights to it. Second, the orthogonalized matrix is never written back: \(C_t\), not \(\mathrm{Ortho}(C_t)\), is what enters the recurrence for \(C_{t+1}\). Third, the note reports that writing it back actually hurt performance, which is why the read-only design is not arbitrary.
Why not compute the SVD
Computing an SVD at every read for every layer would be far too slow on a GPU, and unstable to differentiate through. The Muon literature already solved this: you do not need \(U\) and \(V\) separately, only their product \(U V^\top\), and a Newton-Schulz iteration computes exactly that with matrix multiplications alone. The note normalizes \(C_t\) by its Frobenius norm (with eps = 1e-6) so its singular values start inside a convergent range, then applies five iterations of the quintic map that drives every singular value toward 1 while leaving the singular vectors fixed. Five steps is the same default the Muon optimizer settled on for transformer training (Jordan et al., 2024, Muon: An optimizer for hidden layers). If you want the full derivation of that iteration and its coefficients, our companion piece on the Muon optimizer walks through it line by line.
Why flattening the spectrum could help a memory
Orthogonalization sets every singular value of the matrix to 1. In an optimizer, that stops a handful of dominant gradient directions from swamping the weaker ones. The analogy the note draws is that a memory matrix has the same pathology: after many outer-product writes, a few associations, the ones written with large gates or repeated often, dominate the spectrum, and a query aimed at a rare association has to fight through them. Equalizing the spectrum gives the rare directions the same footing at read time, so the intended value comes out cleaner.
The note grounds this analogy in a specific, recent theoretical result: Muon's advantage over Adam is concentrated in the associative-memory-like parameters of a transformer, precisely because a more isotropic (flatter) singular spectrum helps rare, "tail" associations get learned rather than crowded out (Wang, Zhang, et al., 2025, Muon Outperforms Adam in Tail-End Associative Memory Learning, arXiv:2509.26030). Applying that same equalizing logic to an actual memory matrix, rather than to an optimizer's update, is a clean extrapolation. It is worth being clear that it is an extrapolation: the note motivates the mechanism this way but does not, on its own, prove that this is why the gains appear.
[IMAGE: Two side-by-side bar charts of a memory matrix's singular-value spectrum. Left, the raw memory, a steep decay dominated by two or three tall bars. Right, after orthogonalization, a flat row of bars at 1.0. Annotate the rare-association bars on the left as "buried" and on the right as "recovered."]
Seeing It in Motion
Two views make the read-time detour concrete. The first is the sequence of operations within a single timestep, showing where the orthogonalization slots in and where it deliberately does not.
sequenceDiagram participant In as Input token participant Mem as Matrix memory C participant Ortho as Newton-Schulz block participant Out as Readout In->>Mem: write key-value outer product Note over Mem: C updated by covariance rule In->>Ortho: read query q arrives Mem->>Ortho: copy C for this read only Ortho->>Ortho: normalize then 5 iterations Ortho->>Out: Ortho(C) times q Out->>In: emit output h Note over Mem: raw C carried forward, untouched
The second view is the decision the note actually tested: read-only orthogonalization helps, but the tempting extension of writing the clean matrix back into the state does not.
flowchart TD
S[Orthogonalized matrix<br/>available at read] --> D{Write it back<br/>into the state?}
D -->|Read only| A[Recurrence unchanged<br/>reads see a flat spectrum]
D -->|Write back| B[State overwritten each step<br/>outer-product history lost]
A --> G[Reported gains on<br/>noisy recall]
B --> R[Reported to hurt<br/>performance]
class S blue
class D slate
class A,G emerald
class B,R rose
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
The intuition for why write-back hurts: the raw matrix is a running superposition of every association ever stored, and overwriting it with an orthogonalized version each step throws away the magnitude information that encodes how much of each pair is present. Orthogonalization is useful as a lens, not as a replacement for the record itself.
By the Numbers
The test bed is the noisy in-context recall task from MAD (Mechanistic Architecture Design), a suite of cheap synthetic tasks built to proxy compute-optimal scaling behavior (Poli et al., 2024, Mechanistic Design and Scaling of Hybrid Architectures, arXiv:2403.17844). MAD's noisy recall inserts distractor tokens from a separate vocabulary among the key-value pairs, so the model must both retrieve and ignore. The reported setup:
- Task: MAD noisy recall at
frac_noise = 0.8, over a grid of vocabulary sizes (80, 96) and sequence lengths (512 to 1024). - Training: AdamW (\(\beta = 0.9/0.999\), weight decay 0.01), 2,000 steps, batch size 64, learning rate swept over {3e-4, 1e-3, 3e-3, 1e-2}.
- Orthogonalization: Frobenius normalization (
eps = 1e-6), 5 Newton-Schulz iterations, gradients flowing through, memory not written back. - Evaluation: 24 seeds per regime, final validation accuracy at step 2,000, paired deltas by seed.
- Model size: 77,716 parameters at vocab 80, 80,740 at vocab 96. Genuinely tiny.
| Regime | Orthogonalized | Baseline | Delta |
|---|---|---|---|
| vocab 80, len 512 | 87.5 (20/24 seeds) | 69.1 (17/24) | +18.4 |
| vocab 80, len 768 | 91.7 (22/24) | 75.9 (13/24) | +15.7 |
| vocab 80, len 1024 | 98.5 (23/24) | 83.3 (19/24) | +15.2 |
| vocab 96, len 768 | 62.4 (14/24) | 22.0 (4/24) | +40.4 |
| vocab 96, len 1024 | 68.5 (16/24) | 23.1 (4/24) | +45.4 |
(Accuracies are mean final validation accuracy; parentheses show seeds clearing 80 percent. The source reports 95 percent confidence intervals over 24 seeds; they are wide, on the order of plus or minus 12 to 18 points, which is expected at this model size.)
The pattern is the headline. Orthogonalization wins in every regime, and the margin widens as the task gets harder. At the easy end (vocab 80), both variants mostly work and the gap is a solid but modest 15 to 18 points. At the hard end (vocab 96), the baseline mLSTM has essentially collapsed, 4 of 24 seeds clearing 80 percent, while the orthogonalized variant is still solving two-thirds of seeds. That is consistent with the "equalize the weak directions" story: the intervention helps most exactly where raw capacity is running out.
Two cautions on reading this table. The confidence intervals are wide enough that individual cell values should be treated as directional, not precise; the robust signal is the sign and the trend, not the second digit. And the deltas are paired by seed, which is the right way to measure a per-seed intervention, but it is still one architecture on one task family from one author.
[IMAGE: Grouped bar chart, five regimes on the x-axis, mean accuracy on the y-axis, orthogonalized versus baseline bars side by side with 95 percent CI whiskers. Annotate the vocab-96 pair where the baseline bars collapse toward 20 percent while the orthogonalized bars hold near 65 percent.]
[IMAGE: Scatter of per-seed paired deltas (orthogonalized minus baseline) across all 24 seeds for the vocab-96 len-1024 regime, with the mean +45.4 line drawn and the handful of negative-delta seeds highlighted to show the gain is not universal across seeds.]
Claim-by-claim validation
Because this is a claim about a claim, the sourcing deserves an explicit audit. Each row below was checked against the primary source, not the note's paraphrase of it.
| Claim in the note | Check performed | Result |
|---|---|---|
| mLSTM keeps a matrix memory with a covariance (outer-product) update and is a strong recurrent recall model | xLSTM paper, arXiv:2405.04517 | Confirmed. The paper describes mLSTM's matrix memory and covariance update rule, inspired by associative memories. |
| MQAR is the standard multi-query associative recall benchmark | Zoology paper, arXiv:2312.04927 | Confirmed. MQAR is defined there and is accurately described as a multi-key-value recall formalism. |
| MAD includes a noisy in-context recall task with distractor tokens | MAD paper, arXiv:2403.17844 | Confirmed. Noisy in-context recall is a standard MAD task that inserts tokens from a separate vocabulary. |
| Muon's edge is concentrated in tail-end associative memory via a flatter spectrum | Wang et al., arXiv:2509.26030 | Confirmed. That is the paper's thesis (VO and FFN blocks, heavy-tailed classes), accepted to ICLR 2026, not a stretch of it. |
| Newton-Schulz gives the nearest orthogonal matrix cheaply, as in Muon | Muon writeup, kellerjordan.github.io | Confirmed. Five iterations is Muon's own default; the method computes \(UV^\top\) without an SVD. |
| Fully reproducible code exists | GitHub repo at2005/mlstm-orthogonalize |
Confirmed. The repo has a baseline runner (mad_mlstm_baseline.py) and an NS5 orthogonalized runner (xlstm_ns_mad.py), with CLI flags (--frac-noise, --steps, --batch-size, --lr) matching the stated setup. |
One minor correction worth noting: the note's prose rounds the model size to "roughly 78 thousand to 81 thousand" parameters; the exact figures from the code are 77,716 and 80,740. That is a rounding, not a misrepresentation.
No citation in the note was found to misrepresent its source. The one place to apply judgment is the causal story, that orthogonalization works because it equalizes weak memory directions. That is a plausible, well-motivated hypothesis, but the note does not run the ablations that would confirm the mechanism rather than the outcome: varying the Newton-Schulz iteration count, or directly measuring the memory matrix's singular spectrum before and after the step. The empirical result stands regardless of whether that specific explanation is the right one.
A Concrete Example
Walk one read through the orthogonalization by hand. Take a memory matrix \(C\) whose singular-value spectrum, after many writes, is dominated by one association:
\[S = \mathrm{diag}(4.0,\ 0.9,\ 0.15,\ 0.05).\]The largest stored direction is 80 times the smallest. A query aimed at that smallest, rarest association, say a key-value pair seen once, early, among noise, produces a readout in which the intended value contributes about $0.05$ worth of signal against a $4.0$ competitor. The right value is technically present but effectively buried.
Now orthogonalize. First normalize by the Frobenius norm, \(\lVert C \rVert_F = \sqrt{4.0^2 + 0.9^2 + 0.15^2 + 0.05^2} \approx 4.10\), giving scaled singular values \((0.976,\ 0.220,\ 0.037,\ 0.012)\), all inside \([0,1]\). Then push each through the quintic Newton-Schulz map five times. The dominant value is already near 1 and stays there; the tiny values climb steeply toward 1:
| Iteration | s = 0.012 | s = 0.220 | s = 0.976 |
|---|---|---|---|
| start | 0.012 | 0.220 | 0.976 |
| after 1 | 0.041 | 0.700 | 0.991 |
| after 2 | 0.140 | 0.980 | 1.000 |
| after 3 | 0.440 | 0.999 | 1.000 |
| after 5 | ~0.98 | ~1.00 | ~1.00 |
After five steps all four singular values sit at about 1.0, so the readout uses \(\mathrm{Ortho}(C) = U V^\top\). The rare association that contributed $0.05$ against $4.0$ now contributes on equal footing with the dominant one. The query can pull its intended value out of the superposition instead of losing it under the loudest stored pair. Then the crucial bookkeeping: this flattened matrix is used only to compute this timestep's output. The next write applies to the original \(C\) with its $4.0$-to-$0.05$ spectrum intact, so no magnitude information is destroyed.
That is the mechanism in miniature. The danger for recall is not the average association, it is the variance across the spectrum, and flattening the spectrum at read time is what removes it, on this task, at this scale.
[IMAGE: Line plot of the three tracked singular values across five Newton-Schulz iterations, all converging to 1.0, with the smallest value (0.012) shown climbing steeply from the floor and annotated as "the rare association, recovered."]
Where It Breaks
The note is candid about its limits, and they are worth stating plainly rather than glossing.
Small-model regime only. The models are around 78 thousand to 81 thousand parameters, orders of magnitude below anything deployed. Whether the effect holds, shrinks, or grows at the scale of real language or RL models is simply untested here. Interventions that help tiny models on synthetic tasks have a long history of washing out at scale, and nothing in the note rules that out.
Purely synthetic task. MAD noisy recall is a proxy designed to correlate with downstream performance, not a downstream benchmark. The author explicitly flags that it is unclear whether the gains translate to real tasks. MAD's own justification is that its synthetics correlate with compute-optimal perplexity, which supports using it as a probe but does not promise transfer of any single intervention.
Unquantified compute cost. Five Newton-Schulz iterations per read are five extra matrix multiplications on the memory matrix, at every timestep, in addition to the normal recurrence. The note does not report the wall-clock or FLOP overhead, so a reader cannot yet weigh the accuracy gain against the throughput cost. For an idea whose main appeal is efficiency in long-sequence regimes, that missing number is the one that matters most.
[IMAGE: A line plot with Newton-Schulz iteration count (0 to 8) on the x-axis and two y-axes, one for recall accuracy and one for per-read FLOP overhead. Sketch accuracy rising and saturating by about 5 iterations while overhead climbs linearly, with the "5 iterations" default marked and the crossover region shaded as the unmeasured tradeoff the note leaves open.]
One architecture, one intervention point. Only the mLSTM is tested, and only its reads are orthogonalized. Write-back was tried and hurt, but the note does not explore why, nor whether the same trick transfers to other matrix-memory or linear-attention architectures where it might behave differently.
Wide error bars. With 24 seeds and confidence intervals spanning 12 to 18 points, the per-cell numbers are noisy. The trend across regimes is the trustworthy signal; any single accuracy figure is not.
Alternative Designs
Read-time orthogonalization is one of several ways to attack recurrent recall. It is useful to place it against the neighbours rather than judge it in isolation.
| Approach | Core idea | Strengths | Weaknesses | Best when |
|---|---|---|---|---|
| Plain mLSTM read | Matrix-vector product against raw memory | Cheap, no extra ops, well characterized | Dominant associations bury rare ones under noise | Clean recall, modest sequence lengths |
| Orthogonalized read (this note) | Flatten the memory spectrum at read time only | Large gains where baseline fails, training-time-only, gradients flow through | Unquantified read cost, synthetic evidence, tiny models | Noisy recall where rare associations matter |
| Write-back orthogonalization | Store the orthogonalized matrix as the new state | Conceptually tidy | Reported to hurt, destroys magnitude history | Not recommended by the note |
| Query-key normalization | Normalize queries and keys, not the memory | Standard, cheap, stabilizes retrieval | Does not equalize the stored spectrum itself | General retrieval stabilization |
| More attention-like memory (DeltaNet, RWKV-7) | Richer state update rules | Strong recall, actively developed | Different architecture, not a drop-in read tweak | Building a new recurrent model from scratch |
The closest conceptual cousin is not another memory trick but the Muon optimizer itself, which is the source of both the mathematics and the intuition. Muon orthogonalizes gradient updates to stop a few directions from dominating optimization; this note orthogonalizes a memory matrix to stop a few associations from dominating retrieval. Whether the analogy is load-bearing or merely suggestive is exactly the open question the ablations would settle.
[IMAGE: A two-axis positioning chart. X-axis "where the fix acts" from optimizer to memory state; y-axis "evidence maturity" from synthetic probe to trillion-parameter production. Place Muon/MuonClip top-left, read-time orthogonalization bottom-right, query-key normalization mid-left, and DeltaNet/RWKV-7 mid-right, so the maturity gap of the read-time trick is visible at a glance.]
How It Is Used in Practice
It is not, yet, and the note does not pretend otherwise. This is a research probe, not a production technique, and treating it as more would be the error the piece is careful to avoid.
The honest read on adoption is about lineage, not deployment. The Newton-Schulz orthogonalization at the centre of this idea is the same operation that has already crossed from a hobbyist speedrun leaderboard into frontier-scale training: Muon, and its MuonClip variant, trained a trillion-parameter model end to end (Kimi Team, 2025, Kimi K2: Open Agentic Intelligence, arXiv:2507.20534). So the primitive is battle-tested even though this particular use of it is not. The engineering considerations that would decide a real deployment are the ones the note leaves open: the per-read overhead, whether the gains survive at parameter counts four or five orders of magnitude larger, and whether a non-synthetic recall task shows the same widening-margin pattern.
The most defensible practical statement is the one the author makes: this is a starting point for follow-up, strongest as a signal that spectral structure in a recurrent memory is worth measuring and manipulating, not as a component to ship.
[IMAGE: A simple lineage diagram, the Newton-Schulz orthogonalization primitive in the centre, with one branch labeled "Muon / MuonClip, proven at trillion-parameter scale" and a second, thinner branch labeled "mLSTM read-time memory, small-scale probe," making the maturity gap visual.]
Insights Worth Remembering
- The recall gap between recurrent models and attention is a capacity problem, and capacity problems show up first as the strong associations crowding out the weak ones. Anything that equalizes them is worth a look.
- Orthogonalization is a spectrum flattener, whether it is applied to a gradient update or to a memory matrix. The same one-line operation carries the same intuition across both settings.
- The read-only design is the interesting choice. Using a transformed matrix as a lens while preserving the raw matrix as the record is a pattern that generalizes beyond this specific result.
- Write-back hurting is a clue, not a footnote. It says the magnitude information in the memory matrix is doing real work, and that orthogonalization is useful precisely because it is temporary.
- Gains that widen as the task gets harder are more believable than uniform gains. A trick that only helps where the baseline is failing is behaving like a targeted fix, not a lucky hyperparameter.
- Checking a claim is not the same as endorsing its explanation. Here the citations, benchmarks, and code all hold, and the mechanism is still a hypothesis; both statements are true at once.
- The missing number is the compute overhead. For an efficiency-motivated idea, not reporting the read-time cost is the single gap most likely to change the verdict once filled.
Open Questions
The strongest established fact is narrow and empirical: on MAD noisy recall, at tiny scale, orthogonalizing the mLSTM memory at read time improves accuracy, most where the baseline is weakest, and the sourcing behind that claim is clean. Everything past that is open.
Does it survive scale-up? This is the question the author names first, and it is unresolved. The mechanism, equalizing a spectrum, has no obvious reason to vanish with size, but small-model synthetic gains have a poor track record of transferring, so the honest answer is that nobody knows yet.
Is the causal story correct? The "flatten the weak directions" explanation is plausible and grounded in the Muon tail-memory result, but confirming it needs ablations the note does not run: sweeping the Newton-Schulz iteration count, and measuring the memory's singular spectrum directly before and after the step. Until then, the outcome is established and the mechanism is a well-motivated guess.
What does it cost, and does it transfer? The per-read overhead is unmeasured, and the intervention has been tried on exactly one architecture at one insertion point. Whether the same read-time orthogonalization helps DeltaNet, RWKV-7, or other matrix-memory models, and whether it holds on a non-synthetic recall task, are the concrete next experiments. They are also exactly the experiments that would move this from an interesting probe to a technique worth adopting.
Sources and Further Reading
Primary source under review
- Tambde, A., 2026, Matrix Orthogonalization Improves Memory in Recurrent Models (the experimental note this piece explains and checks)
- Source code: github.com/at2005/mlstm-orthogonalize (baseline and NS5 orthogonalized runners)
Foundational Papers
- Beck, M., Poppel, K., et al., 2024, xLSTM: Extended Long Short-Term Memory, arXiv:2405.04517 (introduces the mLSTM matrix memory and covariance update)
- Arora, S., Eyuboglu, S., et al., 2023, Zoology: Measuring and Improving Recall in Efficient Language Models, arXiv:2312.04927 (introduces MQAR)
- Poli, M., et al., 2024, Mechanistic Design and Scaling of Hybrid Architectures, arXiv:2403.17844 (introduces MAD and noisy in-context recall)
Important Follow-up Work
- Wang, S., Zhang, F., et al., 2025, Muon Outperforms Adam in Tail-End Associative Memory Learning, arXiv:2509.26030 (the tail-memory result that motivates the analogy)
- Jordan, K., et al., 2024, Muon: An optimizer for hidden layers in neural networks (the Newton-Schulz orthogonalization primitive)
- Kimi Team, 2025, Kimi K2: Open Agentic Intelligence, arXiv:2507.20534 (MuonClip at trillion-parameter scale)