Applied LLMs
Choosing LoRA Rank and Alpha
LoRA rank r controls how much task-specific capacity the adapter has, and alpha controls the scaling of that update; choosing them poorly wastes parameters or destabilises training.
intermediate · 7 min read
Most practitioners copy r=8, lora_alpha=16 from a tutorial and move on. That default works surprisingly often, but it also quietly underperforms on tasks with broad vocabulary shift, multi-domain instruction tuning, or continued pretraining. Understanding why those numbers were chosen, and when to change them, converts LoRA from a black-box trick into a tunable tool.
What rank r actually controls
Rank \(r\) is the number of dimensions in the low-rank subspace that the adapter is allowed to occupy. Given a pretrained weight \(W_0 \in \mathbb{R}^{d \times k}\), the adapter update is:
\[\Delta W = B A, \quad B \in \mathbb{R}^{d \times r},\ A \in \mathbb{R}^{r \times k}\]The total number of trainable parameters for that one layer is \(r(d + k)\). For a typical LLaMA-2 attention projection (\(d = k = 4096\)), going from \(r=4\) to \(r=64\) multiplies the adapter's parameter count by 16 while still representing less than 0.5% of the layer's full \(d \times k\) budget.
The practical question is: how much rank does the task actually require? Aghajanyan et al. (2021) showed that fine-tuning tasks have low intrinsic dimensionality: a RoBERTa checkpoint reaches 90% of peak performance on MRPC when optimised in a randomly projected 200-dimensional subspace of the full parameter space. That result motivates small ranks for narrow tasks.
A useful rough guide by task type:
| Task type | Suggested starting rank | Reason |
|---|---|---|
| Single classification / NER | 4-8 | Very low intrinsic dimension; base model already "knows" the structure |
| Domain adaptation (e.g., medical text) | 8-16 | Moderate vocabulary and style shift |
| Broad instruction tuning | 16-64 | Many diverse behaviours to steer simultaneously |
| Continued pretraining on new domain | 64-256 (or full FT) | May require genuinely new knowledge; LoRA struggles here |
Note that "LoRA Learns Less and Forgets Less" (Biderman et al., 2024) showed that full fine-tuning learns perturbations with an effective rank that is roughly 10 to 100x higher than what typical LoRA configurations use. For tasks where learning new knowledge matters more than preserving old knowledge, small-rank LoRA leaves significant performance on the table.
What alpha controls, and why it is not just "set equal to r"
The LoRA forward pass scales the adapter's contribution by \(\alpha / r\):
\[h = W_0 x + \frac{\alpha}{r} \cdot B A x\]This scalar multiplier sets the effective learning rate of the adapter relative to the frozen base. If you double \(r\) while keeping \(\alpha\) fixed, the multiplier halves, the adapter's gradient signal weakens, and training slows. Conversely, keeping \(\alpha = r\) (the ratio equals 1) holds the effective magnitude constant as you sweep \(r\).
A common convention:
alpha = r: adapter scale is 1.0; stable starting point, easy to interpret.alpha = 2r(e.g.,r=8, alpha=16): adapter is scaled up 2x. Many tutorials use this as default because it marginally accelerates convergence on common benchmarks, but the extra scale can cause instability on aggressive learning rates.alpha = 16(fixed): freezealpha, sweep onlyr. This is clean for ablations because decreasingrwith fixedalphaincreases the scale - which can compensate for the lost capacity, up to a point.
The key insight is that \(\alpha / r\) acts as a per-adapter learning rate multiplier. If your global learning rate is already tuned, keeping \(\alpha / r\) constant across rank experiments reduces one confound.
Rank-stabilised LoRA: when vanilla scaling breaks
The standard \(\alpha / r\) scaling has a known failure mode at high ranks. As \(r\) grows, the gradient flow through the adapter changes in a way that is not fully corrected by \(\alpha / r\), causing the effective update magnitude to shrink. Kalajdzievski (2023) demonstrated this formally and proposed rsLoRA, which changes the scaling to:
\[h = W_0 x + \frac{\alpha}{\sqrt{r}} \cdot B A x\]The $1/\sqrt{r}$ factor is consistent with how neural network initialisations scale with width in the mean-field / NTK literature. In practice:
- rsLoRA is now a one-line toggle in HuggingFace PEFT:
LoraConfig(..., use_rslora=True). - With standard LoRA, raising rank from 8 to 64 often yields diminishing or negative returns on quality metrics unless
alphais carefully re-tuned. With rsLoRA, higher ranks consistently improve or maintain quality. - The cost is training computation only; inference is unaffected.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=64,
lora_alpha=16,
use_rslora=True, # scaling becomes alpha / sqrt(r) = 16/8 = 2.0
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(base_model, config)
With r=8 and standard LoRA, alpha=16 gives scale 2.0. With r=64 and rsLoRA, alpha=16 gives scale $16/8 = 2.0$ as well. The adapter's effective weight is held constant across ranks, making alpha a clean dial for overall adapter strength rather than a rank-dependent fudge factor.
A practical search strategy
Grid search over both r and alpha is expensive. A more efficient approach:
- Fix
alpha = r(or enable rsLoRA and fix alpha independently). This decouples rank from effective scale. - Start at
r = 8and double: evaluate 8, 16, 32 on a held-out subset. The quality curve usually plateaus well before 128 for most fine-tuning tasks. - Once a rank plateau is found, tune
alphaby varying the ratio \(\alpha / r\) (or the absolute alpha with rsLoRA) from 0.5 to 4.0. This adjusts how aggressively the adapter steers the model without touching the capacity budget. - Monitor the adapter norm during training:
torch.linalg.norm(lora_B @ lora_A). If it grows unboundedly, the scale is too high; if it barely moves, scale or learning rate may be too low.
AdaLoRA (Zhang et al., 2023) removes manual rank search by allocating a global rank budget and pruning low-importance singular values during training. It consistently matches or beats manually-tuned fixed-rank LoRA but requires SVD at each training step, which adds roughly 20 to 30% training overhead.
When it falls down
Rank too low for knowledge-intensive tasks. If a model has never seen a domain (legal French, genomic sequences, specialised mathematics notation), a rank-4 adapter cannot store the necessary new associations. The result is a model that mimics the style of the target domain but makes factual errors that full fine-tuning would avoid. The Biderman et al. finding, that full fine-tuning uses 10 to 100x the effective rank of typical LoRA configs, is the quantitative statement of this problem.
Alpha too high with aggressive learning rates. alpha/r = 4 combined with a learning rate above 3e-4 on a sensitive layer often causes the adapter norm to explode in the first few hundred steps. The fix is to lower alpha, lower the learning rate, or both. Watching the loss curve for a sharp spike in the first epoch is usually sufficient to catch this.
Copying hyperparameters across model scales. An r=8, alpha=16 configuration tuned on LLaMA-2 7B does not transfer cleanly to 70B. The weight matrices are larger (\(d\) increases), so the same rank covers a smaller fraction of the representational space. A rough heuristic: scale r proportionally to the square root of the increase in hidden dimension.
rsLoRA with very small ranks. The \(\alpha / \sqrt{r}\) scaling blows up as \(r \to 1\): at r=1, alpha=16, the scale is 16.0, which is often too aggressive. rsLoRA is most useful at r >= 8; for very low ranks, stick to standard scaling and tune alpha directly.
Over-interpreting singular value spectra. AdaLoRA and similar methods prune low singular values, implying the dropped dimensions carry no useful signal. This holds for most tasks, but for tasks with sharp distributional shift (e.g., a base model fine-tuned on adversarial prompts), many singular values may carry small but consistent signal, and aggressive pruning degrades robustness rather than just wasting capacity.
Further reading
- Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2021): https://arxiv.org/abs/2106.09685
- Aghajanyan et al., "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning" (2021): https://arxiv.org/abs/2012.13255
- Kalajdzievski, "A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA" (2023): https://arxiv.org/abs/2312.03732
- Biderman et al., "LoRA Learns Less and Forgets Less" (2024): https://arxiv.org/abs/2405.09673