DoRA: Weight-Decomposed LoRA

Standard LoRA with rank 16 on LLaMA-7B reaches 74.7% average accuracy across eight commonsense reasoning benchmarks. Full fine-tuning reaches roughly 79%. DoRA, with the same rank and the same parameter budget, reaches 78.4% - recovering most of that gap by changing not what parameters are trained, but how the weight update is structured.

That five-point gap is not a minor tuning artefact. It traces to a fundamental constraint in LoRA's update geometry: magnitude and direction are forced to move together, whereas full fine-tuning moves them independently. DoRA was designed to remove that constraint.

What LoRA Gets Wrong About Weight Updates

LoRA freezes the pre-trained weight matrix W and adds a low-rank perturbation:

W' = W + BA      (B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, r ≪ min(d,k))

The key empirical observation in the DoRA paper (Liu et al., ICML 2024) is that, when you measure the Pearson correlation between magnitude changes and directional changes across weight matrices during LoRA training, you get a correlation of +0.83. Magnitude and direction are coupled: when one grows, so does the other, almost deterministically.

Full fine-tuning shows a correlation of -0.62 - a weak negative relationship, meaning the two dimensions adjust largely independently and sometimes in opposite directions. Pre-trained weights already encode useful structure; effective adaptation often calls for refining direction while holding magnitude roughly steady, or vice versa.

LoRA cannot do this. Its rank decomposition modifies a single additive delta, so any factorisation of that delta into "magnitude change" and "direction change" is entangled. DoRA addresses this at the decomposition level.

The Magnitude-Direction Decomposition

Any matrix W can be written as:

W = m · (V / ‖V‖_c)

where:

m ∈ ℝ^{1×k} is the magnitude vector (one scalar per output column, capturing the column-wise L2 norms of W).
V ∈ ℝ^{d×k} is the directional matrix (W rescaled so each column has unit norm).
‖·‖_c denotes column-wise vector norms.

This decomposition is not an approximation - it is exact for any matrix. Its value is that it makes the two axes of variation explicit and separately addressable.

DoRA initialises from the pre-trained weight W₀ by computing m₀ = ‖W₀‖_c and V₀ = W₀ column-normalised. It then:

Makes m a learnable parameter (a single vector of size k, added overhead is negligible).
Applies LoRA to the directional component, not to W directly.

The fine-tuned weight becomes:

W' = m̄ · (W₀ + BA) / ‖W₀ + BA‖_c

where m̄, B, and A are all trainable; W₀ is frozen. The magnitude vector scales the final result; the LoRA matrices B and A steer direction.

At initialisation, B = 0 (standard LoRA practice), so W' = m̄ · W₀/‖W₀‖_c = m₀ · W₀/‖W₀‖_c = W₀. No-op at start, just as in LoRA.

What LoRA Gets Wrong About Weight Updates

The Magnitude-Direction Decomposition

Keep reading with Pro.