← Blog

Muon and MuonClip: The Optimizer That Broke Adam's Monopoly on LLM Pretraining

June 30, 2026 · 23 min read

When Kimi K2, a one-trillion-parameter mixture-of-experts model, finished pretraining on 15.5 trillion tokens without a single loss spike, the headline feature was not the data or the architecture. It was the optimizer. K2 was trained with MuonClip, a descendant of an optimizer named Muon that began life in late 2024 as the fastest known way to train a 124-million-parameter GPT on a single node (Kimi Team, 2025, Kimi K2: Open Agentic Intelligence, arXiv:2507.20534). In roughly nine months an idea proven on a hobbyist speedrun leaderboard became the optimizer of record for one of the largest open-weight models ever released.

That trajectory is unusual. The graveyard of optimizers claiming to beat AdamW is large and well populated, and almost none of them survive contact with a real training run. Muon is the rare exception, and the reason is worth understanding in detail: it treats a weight matrix as a matrix, not as a bag of independent scalars.

Why this matters: Adam and its variant AdamW have been the default for nearly every large language model since GPT-2. An optimizer that reaches the same loss in roughly half the compute is not a marginal tuning win; at frontier scale it is the difference between one training run and two.

TL;DR

  • Adam optimizes each weight independently as a scalar. Muon instead orthogonalizes the momentum matrix before each step, replacing the update with the nearest semi-orthogonal matrix using a cheap Newton-Schulz iteration (Jordan et al., 2024, Muon: An optimizer for hidden layers).
  • The orthogonalization costs almost nothing. For typical LLM batch sizes the extra FLOP overhead is below 1%, because the cost scales as \(Tm/B\) (iterations times model width over batch tokens).
  • Moonshot's scaling-law study found Muon reaches AdamW-equivalent loss with roughly half the training FLOPs, a 2x compute-efficiency gain at compute-optimal settings (Liu et al., 2025, Muon is Scalable for LLM Training, arXiv:2502.16982).
  • Naively scaling Muon breaks. Attention logits explode past magnitude 1000, a failure mode that hits Muon harder than AdamW. MuonClip fixes this by rescaling the query and key weights at the source (QK-Clip).
  • Two unglamorous additions made Muon work out of the box at scale: decoupled weight decay, and an RMS rescale so its update magnitude matches AdamW's, letting teams reuse learning rates.
  • Muon only applies to 2D hidden weights. Embeddings, the output head, and all scalar/vector parameters still use AdamW. It is a hybrid optimizer, not a replacement.

At a Glance

The whole method is one extra step bolted onto SGD with momentum: take the usual momentum matrix, orthogonalize it, then apply it.

flowchart LR
  G["Gradient G"] --> M["Momentum buffer B"]
  M --> NS["Newton-Schulz<br/>orthogonalize"]
  NS --> O["Update approx U Vt"]
  O --> RMS["RMS rescale"]
  RMS --> W["Weight matrix W"]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class G,M blue
  class NS,RMS purple
  class O,W teal

The orthogonalization is the entire idea. Everything else, the momentum, the weight decay, the rescale, is there to make that one operation behave at scale.

Before Muon: Eight Years of Adam

Adam arrived in 2015 and never left (Kingma and Ba, 2015, Adam: A Method for Stochastic Optimization, arXiv:1412.6980). Its appeal was practical: per-parameter adaptive learning rates derived from running estimates of the first and second moments of the gradient, robust enough that you could throw it at almost any architecture and get a usable result. AdamW later decoupled weight decay from the gradient update, fixing a subtle regularization bug, and became the standard for transformers (Loshchilov and Hutter, 2019, Decoupled Weight Decay Regularization, arXiv:1711.05101).

The catch is that Adam is element-wise. It looks at each scalar weight in isolation and asks how large that scalar's recent gradients have been. It is blind to the fact that a weight matrix has geometric structure: rows and columns that interact, a spectrum of singular values, directions that matter more than others. Second-order methods like Shampoo tried to exploit that structure with full preconditioners, but the matrix roots they required were expensive and numerically delicate, so they stayed niche (Gupta et al., 2018, Shampoo: Preconditioned Stochastic Tensor Optimization, arXiv:1802.09568).

[IMAGE: Side-by-side schematic of an Adam update (a grid of independent scalars each scaled separately) versus a Muon update (a single matrix transformed as a whole), annotated to show Adam ignoring cross-weight structure that Muon preserves]

The conceptual unlock came from a quieter line of work on what the "right" norm for a weight update should be. Bernstein and Newhouse showed that several seemingly different optimizers fall out of choosing a norm under which to take steepest descent, and that for the spectral norm the optimal update is the orthogonalized gradient (Bernstein and Newhouse, 2024, Old Optimizer, New Norm: An Anthology, arXiv:2409.20325). Muon is the engineering realization of that insight, made fast enough to use.

timeline
  title From Adam to MuonClip
  2015 : Adam introduced
  2018 : Shampoo full-matrix preconditioning
  2019 : AdamW decouples weight decay
  2024 : Bernstein-Newhouse spectral norm theory
  2024 : Muon sets NanoGPT speed records
  2025 : Moonlight scales Muon to 16B MoE
  2025 : MuonClip trains trillion-param Kimi K2

The leap that matters here is the last three rows. Muon was demonstrated on a competitive task with a well-tuned baseline, then stress-tested at 16 billion parameters, then hardened for a trillion. Each step exposed a problem the previous scale had hidden.

How Muon Actually Works

Start with the familiar part. SGD with momentum keeps a running buffer of past gradients and steps in that direction:

\[B_t = \mu B_{t-1} + G_t\]

where \(G_t\) is the gradient of the loss with respect to a weight matrix \(W\), and \(\mu\) is the momentum coefficient. Plain SGD would now update \(W_t = W_{t-1} - \eta B_t\). Muon inserts one operation first: it orthogonalizes \(B_t\).

What orthogonalization means here

Take the singular value decomposition \(B_t = U S V^\top\). The matrix \(U S V^\top\) has the same column and row spaces as \(B_t\), but the diagonal matrix \(S\) tells you that some directions carry enormous weight and others almost none. Empirically, the momentum matrices in transformer layers are nearly low-rank: a handful of singular values dominate, and the update is effectively pointing in just a few directions. Orthogonalization replaces \(B_t\) with the nearest semi-orthogonal matrix, which works out to be exactly \(U V^\top\):

\[\mathrm{Ortho}(B_t) = \arg\min_{O} \{ \|O - B_t\|_F : O^\top O = I \text{ or } O O^\top = I \} = U V^\top\]

In words, it flattens the spectrum. Every singular value is set to 1, so the rare directions that the raw update would have ignored get the same scale as the dominant ones. Keller Jordan's hypothesis is that those rare directions carry real learning signal, and that Adam and SGD waste capacity by letting a few directions swamp them (Jordan et al., 2024).

[IMAGE: Side-by-side heatmaps of a momentum matrix's singular-value spectrum before and after orthogonalization, the "before" panel showing a steep decay dominated by 3-4 values and the "after" panel a flat line at 1.0, annotated to highlight the amplified tail directions]

Why Newton-Schulz instead of SVD

Computing a true SVD every step for every weight matrix is far too slow on a GPU. The trick is that you never need \(U\) and \(V\) explicitly; you only need the product \(U V^\top\). A Newton-Schulz iteration computes exactly that by repeatedly applying a fixed quintic polynomial to the matrix. One step looks like:

\[X \leftarrow a X + b (X X^\top) X + c (X X^\top)^2 X\]

Because \(X = U S V^\top\), each step acts only on the singular values: it maps \(S\) through the scalar polynomial \(\varphi(s) = a s + b s^3 + c s^5\) while leaving \(U\) and \(V\) untouched. Choose coefficients so that \(\varphi\) drives every singular value toward 1, normalize \(B_t\) by its Frobenius norm first so the values start in \([0,1]\), and after a handful of iterations \(X \approx U V^\top\). Jordan tuned the coefficients to \((a, b, c) = (3.4445, -4.7750, 2.0315)\), which pushes convergence fast enough that five iterations suffice for transformer training. Crucially, the iteration is numerically stable in bf16, unlike the float32-only coupled iterations that Shampoo needs (Jordan et al., 2024).

Here is the actual kernel, which is short enough to read in full:

def newton_schulz5(G, steps=5, eps=1e-7):
    a, b, c = (3.4445, -4.7750, 2.0315)
    X = G.bfloat16()
    X /= (X.norm() + eps)          # singular values now in [0,1]
    if G.size(0) > G.size(1):
        X = X.T
    for _ in range(steps):
        A = X @ X.T
        B = b * A + c * A @ A
        X = a * X + B @ X          # phi applied to the spectrum
    if G.size(0) > G.size(1):
        X = X.T
    return X

[IMAGE: Annotated plot of the polynomial phi(x) over x in 0 to 1, showing the baseline coefficients (2, -1.5, 0.5) as a gentle curve and the tuned coefficients as a steeper one, with the convergence band 0.7 to 1.3 shaded and the fixed point at x=1 marked]

The flow of one Muon step

flowchart TD
  A["Compute gradient G_t"] --> B["Update momentum<br/>B = mu B + G"]
  B --> C["Normalize by<br/>Frobenius norm"]
  C --> D["5 Newton-Schulz<br/>iterations"]
  D --> E{"Singular values<br/>near 1?"}
  E -->|"yes"| F["Scale by RMS<br/>constant 0.2 root dim"]
  E -->|"no"| D
  F --> G["Decoupled<br/>weight decay"]
  G --> H["Apply update to W"]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class A,B blue
  class C,D,E purple
  class F,G amber
  class H teal

What had to change for scale

The blog-era Muon worked on small models out of the box but stumbled when Moonshot pushed it to billions of parameters. Two fixes mattered (Liu et al., 2025, arXiv:2502.16982). First, decoupled weight decay, the same AdamW lesson: without it, weights and their updates grew unbounded over a long run, eventually hurting performance in bf16. Second, consistent RMS matching. An orthogonal matrix has a fixed Frobenius norm regardless of its shape, which means the per-element root-mean-square of a Muon update depends on the matrix dimensions. Moonshot rescaled each update by a constant (0.2) times a shape factor so that the update RMS lands near AdamW's typical value of about 0.2 to 0.3. The payoff is purely practical: teams can port AdamW learning rates directly instead of re-tuning from scratch.

[IMAGE: Grouped bar chart of per-update RMS across several weight-matrix shapes, comparing raw orthogonal Muon (varying with shape) against RMS-matched Muon and AdamW (both flat near 0.2 to 0.3), annotated to show the rescale collapsing the shape dependence]

MuonClip and the attention-logit explosion

At the trillion-parameter scale a new failure surfaced. The maximum attention logit, the raw dot product of a query and key before softmax, climbed past 1000 and kept going, eventually destabilizing training. This happens under Muon more readily than under AdamW, and the usual patches (logit soft-capping, query-key normalization) were judged inadequate by the Kimi team (Kimi Team, 2025, arXiv:2507.20534).

QK-Clip attacks the cause rather than the symptom. After each optimizer step, if the maximum logit in an attention head exceeds a threshold \(\tau\) (Kimi used \(\tau = 100\)), the query and key projection weights for that head are rescaled down by a factor derived from the overshoot:

\[W_q \leftarrow \gamma\, W_q, \quad W_k \leftarrow \gamma\, W_k, \quad \gamma = \sqrt{\tau / S_{\max}}\]

where \(S_{\max}\) is the observed maximum logit. The rescale is applied to the weights themselves, so it does not perturb the current step's forward or backward pass; the optimization dynamics are preserved, and only future logits are bounded. Combined with Muon, weight decay, and RMS matching, this package is MuonClip. Kimi K2 reached the capped logit value early in training and decayed back to a stable range after roughly the first 30% of steps, with zero loss spikes across the full 15.5-trillion-token run (Kimi Team, 2025, arXiv:2507.20534).

Seeing It in Motion

Two views clarify what the prose leaves implicit: where Muon sits inside a transformer, and how QK-Clip slots into a training step.

The first is architectural. Muon does not optimize everything. It governs only the 2D hidden weights; the embedding table, the output classifier head, and every scalar or vector parameter stay on AdamW, because empirically those layers train better under Adam's per-element adaptivity.

graph TD
  subgraph Transformer
    EMB["Token embeddings"]
    QKV["Q K V projections"]
    ATTN["Attention output proj"]
    FFN["FFN weight matrices"]
    HEAD["Output head"]
    NORM["LayerNorm gains/bias"]
  end
  MUON["Muon optimizer"]
  ADAM["AdamW optimizer"]
  QKV --> MUON
  ATTN --> MUON
  FFN --> MUON
  EMB --> ADAM
  HEAD --> ADAM
  NORM --> ADAM
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0
  class QKV,ATTN,FFN purple
  class EMB,HEAD,NORM blue
  class MUON,ADAM slate

A subtle implementation detail: Muon performs best when applied to the Q, K, and V projections separately rather than as one fused QKV matrix, because fusing them mixes three different subspaces into a single orthogonalization.

The second view is temporal. QK-Clip is a post-step guard that watches the realized logits and corrects the weights only when they breach the threshold.

sequenceDiagram
  participant Train as Training loop
  participant Muon as Muon update
  participant Attn as Attention monitor
  participant Clipr as QK-Clip
  Train->>Muon: backward gives gradients
  Muon->>Muon: orthogonalize and step
  Muon->>Attn: weights updated
  Attn->>Attn: measure max logit S
  alt S over threshold tau
    Attn->>Clipr: report overshoot
    Clipr->>Clipr: gamma = sqrt(tau / S)
    Clipr->>Muon: rescale Wq and Wk
  else S within bound
    Attn->>Train: no action
  end
  Train->>Train: next step

Notice that the clip touches only the weights, never the activations of the step that just ran. That is what makes it cheap and side-effect-free: it changes the future without rewriting the present.

By the Numbers

The headline claim is compute efficiency, and it is backed by a scaling-law study rather than a single lucky run. Moonshot trained matched Muon and AdamW models across sizes and fit the loss-versus-FLOPs curves; Muon's curve sits below AdamW's, reaching equal loss at roughly half the FLOPs.

Metric Muon AdamW Source
Compute to match loss (compute-optimal) ~52% of FLOPs 100% (baseline) Liu et al., 2025, arXiv:2502.16982
Optimizer state memory per 2D weight 1 buffer (momentum) 2 buffers (m and v) Jordan et al., 2024
Extra FLOP overhead per step under 1% 0% Jordan et al., 2024
NanoGPT speedrun improvement on adoption ~35% faster baseline record Jordan et al., 2024
1.5B to GPT-2 XL HellaSwag level ~10 H100-hours (8x) ~13.3 H100-hours (8x) Jordan et al., 2024

The memory line deserves emphasis. AdamW stores two state tensors per parameter, the first and second moments. Muon stores one momentum buffer, so for the 2D weights it governs it roughly halves optimizer-state memory, which at frontier scale frees real device memory for batch or activation budget.

The overhead figure comes from a clean piece of analysis. Each Newton-Schulz step on an \(n \times m\) matrix costs on the order of \(n m^2\) matmul FLOPs, while the layer's own forward-backward pass costs $6 n m B$ for a batch of \(B\) tokens. The ratio works out to:

\[\text{overhead} \approx \frac{T m}{B}\]

For the NanoGPT record (\(m = 768\), \(B = 524288\) tokens, \(T = 5\)) this is about 0.7%. For a Llama-405B-scale run (\(m = 16384\), \(B \approx 1.6 \times 10^7\)) it is about 0.5% (Jordan et al., 2024). The orthogonalization is essentially free; you pay for it in code complexity, not compute.

[IMAGE: Log-log scaling-law plot, training loss on the y-axis versus training FLOPs on the x-axis, two fitted lines for Muon and AdamW, with a horizontal guide showing Muon hitting a target loss at roughly half the FLOPs of AdamW]

A Concrete Example

Walk a single weight matrix through one Muon step. Take a tiny FFN weight \(W\) of shape $4 \times 4$, and suppose after accumulating momentum the buffer \(B\) has the singular-value spectrum \(S = \mathrm{diag}(3.0,\ 0.8,\ 0.2,\ 0.05)\). The largest direction is 60 times the smallest. Under plain SGD-momentum, the update is dominated almost entirely by that first direction; the fourth barely moves.

Now orthogonalize. First normalize by the Frobenius norm \(\|B\|_F = \sqrt{3.0^2 + 0.8^2 + 0.2^2 + 0.05^2} \approx 3.11\), giving scaled singular values \((0.964,\ 0.257,\ 0.064,\ 0.016)\), all inside \([0,1]\). Then push each through \(\varphi(s) = 3.4445 s - 4.7750 s^3 + 2.0315 s^5\) five times. Tracing the smallest value:

Iteration s = 0.016 s = 0.257 s = 0.964
start 0.016 0.257 0.964
after 1 0.055 0.802 0.985
after 2 0.189 1.041 1.000
after 3 0.595 0.990 1.000
after 5 ~0.99 ~1.00 ~1.00

[IMAGE: Line plot of the three tracked singular values across five Newton-Schulz iterations, all three converging to 1.0, with the smallest (0.016) shown climbing steeply and the middle value briefly overshooting above the convergence band before settling]

(The middle column overshoots above 1 and settles back, which is exactly the \(\varepsilon\) tolerance band the coefficients were tuned to allow.) After five steps all four singular values sit at about 1.0, so the effective update is \(U V^\top\): the fourth direction, which raw momentum would have nearly ignored, now contributes as strongly as the first. The matrix has been told to move in all its informative directions, not just the loudest one.

Finally, rescale by the RMS constant so the update magnitude matches what an AdamW step would have produced, apply decoupled weight decay, and write the result to \(W\). If this were a query or key projection and the post-step max logit came out at, say, 180 against a threshold of 100, QK-Clip would multiply \(W_q\) and \(W_k\) by \(\sqrt{100/180} \approx 0.745\) before the next step. That is the entire mechanism, reproducible on paper.

Where It Breaks

Muon is not a drop-in replacement for AdamW everywhere, and treating it as one is the most common way to get burned.

It is defined only for 2D parameters. Embeddings and the output head are technically 2D but train worse under Muon, so they need AdamW anyway; every scalar gain, bias, and normalization parameter is 1D and has no orthogonalization to apply. A real training stack therefore runs two optimizers side by side, with the bookkeeping that implies.

The attention-logit explosion is the sharpest failure. It is not hypothetical; it is what forced QK-Clip into existence. Because orthogonalization keeps pushing energy into directions that Adam would have damped, Muon is more prone to the runaway dot-product growth that destabilizes softmax attention. Without an intervention like QK-Clip, a large run can diverge.

Distribution is genuinely harder than for AdamW. Adam's update is element-wise and trivially shardable; you can split a weight tensor across devices and each shard updates independently. Muon's Newton-Schulz iteration involves matrix products over the whole weight matrix, so a tensor-parallel layout has to gather or communicate partial products. Moonshot's implementation includes a distributed scheme to make this practical, and getting it efficient is real engineering, not a config flag (Liu et al., 2025, arXiv:2502.16982).

There are open empirical edges too. The orthogonalization runs in bf16, which is fast but means the approximation quality of five Newton-Schulz steps depends on the conditioning of each matrix; pathological matrices may need more steps. And the evidence base, while strong, is concentrated in pretraining. Whether Muon's advantage holds for supervised fine-tuning and reinforcement-learning post-training is less settled.

[IMAGE: Line chart of maximum attention logit versus training step, one curve for plain Muon climbing past 1000 and diverging, a second curve for MuonClip pinned at the cap of 100 then decaying after roughly 30% of steps, with the divergence point on the first curve annotated]

Alternative Designs

Muon sits in a small family of matrix-aware and structure-aware optimizers. The honest comparison is not "Muon versus everything" but where each method's tradeoff lands.

Approach Strengths Weaknesses Best when
AdamW Universal, trivially shardable, battle-tested Element-wise, blind to matrix structure, 2x state memory Default for any architecture, post-training, small runs
Muon / MuonClip ~2x compute efficiency, half optimizer memory, under 1% overhead 2D-only, harder to distribute, needs QK-Clip at scale Large transformer pretraining where compute dominates
Shampoo (distributed) Full-matrix preconditioning, strong per-step progress Expensive matrix roots, float32 needed, complex to run Settings that can absorb heavy per-step optimizer cost
Orthogonal-SGDM Earlier orthogonalization idea via SVD SVD too slow, momentum placed after orthogonalization, beaten by tuned SGD Mainly of historical interest
Lion Cheap, sign-based, one state tensor Sensitive to LR and batch size, less matrix structure Memory-constrained runs tolerant of tuning

[IMAGE: Two-axis positioning chart placing AdamW, Muon/MuonClip, Shampoo, Orthogonal-SGDM, and Lion by compute efficiency versus distribution difficulty, annotated to show Muon high on efficiency but costlier to distribute than AdamW]

Two relationships are worth drawing out. Muon is closely related to Shampoo: if you strip Shampoo's preconditioner accumulation, its update reduces to the same orthogonalized gradient Muon computes, but via expensive inverse-fourth roots instead of cheap Newton-Schulz iterations (Bernstein and Newhouse, 2024, arXiv:2409.20325). And the idea of orthogonalizing gradients is older than Muon: Tuddenham and colleagues proposed Orthogonal-SGDM in 2022 using SVD, but placed momentum after orthogonalization and found their method beaten by well-tuned SGD, which is partly why it went unnoticed (Tuddenham et al., 2022, Orthogonalising gradients to speed up neural network optimisation, arXiv:2202.07052). Muon's contribution was making the operation cheap, putting momentum first, and proving it on a task with a well-tuned baseline.

How It Is Used in Practice

The clearest production signal is Kimi K2: a 1-trillion-parameter MoE with 32 billion activated parameters per token, 384 experts, trained end to end on 15.5 trillion tokens with MuonClip and no loss spikes (Kimi Team, 2025, arXiv:2507.20534). That is the existence proof the field was waiting for, and it directly answers the question Jordan posed in his original writeup: whether Muon would scale to 20B-plus parameters and trillion-token runs.

Before K2 there was Moonlight, the proving ground: a 16-billion-parameter MoE with 3 billion activated parameters, trained on 5.7 trillion tokens, released with open weights and an open Muon implementation. Moonlight is where the weight-decay and RMS-matching fixes were validated, and where the scaling-law claim of roughly 2x compute efficiency was measured (Liu et al., 2025, arXiv:2502.16982). The code lives in the public Moonlight repository, and a standalone Muon implementation is maintained alongside the NanoGPT speedrun records.

The engineering considerations at scale are concrete. You need the two-optimizer split wired correctly so embeddings and the head route to AdamW. You need a distributed Newton-Schulz that does not bottleneck on communication. You need QK-Clip or an equivalent logit guard if you are training large enough to hit the explosion. None of these is exotic, but all three are load-bearing; skip the last one on a big run and the loss curve can come apart.

[IMAGE: Annotated training-loss curve for Kimi K2 over 15.5T tokens, flat and spike-free, with a callout box reading "zero loss spikes" and a small inset showing the max-logit trace capped at 100 and decaying]

Insights Worth Remembering

  • The core move is one sentence: replace the update matrix with the nearest orthogonal matrix. Everything else is making that cheap and stable.
  • Orthogonalization is a spectrum flattener. It amplifies the rare, low-magnitude directions in a weight update that element-wise optimizers let drown.
  • You never compute the SVD. Newton-Schulz gives you \(U V^\top\) directly through a polynomial that acts only on the singular values, and it runs in bf16.
  • The overhead formula \(Tm/B\) is the whole economic argument: as batch size grows with scale, the relative cost of orthogonalization shrinks, so Muon gets cheaper in relative terms as models get bigger.
  • Muon's strengths and its instabilities share a root cause. Pushing energy into neglected directions is why it learns faster and why attention logits explode without a guard.
  • The unglamorous fixes carried the day. Weight decay and an RMS rescale, not a new theory, are what let teams reuse AdamW hyperparameters and made adoption painless.
  • Competitive benchmarks did the validation that peer review could not. Muon earned trust by holding the NanoGPT speed record across a dozen runs set by independent researchers, not by a single paper's claim.

Open Questions

The strongest evidence is for pretraining. Whether Muon's compute advantage transfers cleanly to supervised fine-tuning and RL post-training is not yet well established; Jordan flagged this himself, and it remains an active question rather than a settled result.

Distribution at the largest scales is still being worked out. Moonshot's distributed Muon makes trillion-parameter training feasible, but whether the Newton-Schulz communication pattern stays efficient on the next generation of clusters, and whether better-conditioned variants reduce the iteration count, are open engineering problems. Recent work on faster orthogonalization via Chebyshev-type polynomials suggests the five-iteration default is not the last word, though those results are early and not yet standard.

A deeper theoretical question is why orthogonalization helps as much as it does. The spectral-norm steepest-descent framing gives a principled motivation, and the low-rank-update observation gives an empirical one, but a full account of when and how much Muon beats AdamW across architectures is still being assembled. The most likely near-term development is not a single successor but a settling-out: weight decay, RMS matching, and a logit guard becoming the assumed baseline for any orthogonalizing optimizer, the way decoupled weight decay became standard for Adam.

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Technical Blogs and Code

Sign in to save and react.
Share Copied

Related reading