LoRA Training Pitfalls

You spend three hours fine-tuning a 7B model with LoRA, eval loss curves look healthy, and then the model hallucinates on every domain-specific prompt you care about. The loss was fine. The checkpoint was wrong. This is not unusual. LoRA's elegance as a training method hides a handful of failure modes that are easy to miss precisely because they don't surface as obvious training crashes.

This concept unpacks the most consequential pitfalls: rank misconfiguration, the hidden learning-rate asymmetry between adapter matrices, wrong target modules, quantisation-induced initialisation errors, and catastrophic forgetting vs. under-adaptation. Knowing where each one hides lets you debug faster and design better fine-tuning runs.

How LoRA Works (the part that creates the pitfalls)

Recall the mechanics briefly, because each pitfall connects to a specific design choice. LoRA freezes the pre-trained weight matrix W_0 ∈ R^{d×k} and injects a low-rank bypass:

W = W_0 + BA,   where B ∈ R^{d×r},  A ∈ R^{r×k},  r << min(d, k)

A is initialised with Kaiming-uniform noise; B is initialised to zero. This ensures the adapter contributes nothing at the start of training (delta W = BA = 0), preserving the base model's behaviour on step zero. The final contribution is scaled by alpha / r before being added to the frozen weight.

Three variables control almost everything downstream: the rank r, the scaling ratio alpha / r, and which weight matrices receive adapters (target modules). Getting any one of them wrong produces a distinct class of problem.

Pitfall 1: Rank That is Too Low (or Too High)

The most common mistake is treating rank as a dial that only affects parameter count, when it actually bounds the expressivity of the adaptation.

Hu et al. (2021) found that for GPT-3 adaptation, rank 4 to 8 was sufficient for many NLP tasks because the task-specific signal lies in a low-dimensional subspace. But "many NLP tasks" is not "your task." Code generation, long-form reasoning, and domain shift from a narrow corpus routinely need higher rank. Biderman et al. (2024) showed that full fine-tuning implicitly learns perturbations whose effective rank is 10 to 100 times larger than typical LoRA configurations, and that LoRA substantially underperforms full fine-tuning on these harder adaptation targets.

The trap: loss curves converge at any rank because the model just learns what it can within the subspace you gave it. There is no "rank too low" warning in your training logs.

Practical guidance: start at r = 16 for general instruction following, r = 32 or higher for domain-heavy or code tasks. If your validation loss plateaus early compared to a full-finetune baseline, suspect rank before any other hyperparameter.

The opposite failure is also real. Setting rank very high (r = 128, r = 256) with the default scaling alpha / r causes the effective learning rate for the adapter to shrink proportionally. Kalajdzievski (2023) showed that this stunts learning with higher-rank adapters. The fix is use_rslora=True, which changes the scaling to alpha / sqrt(r), keeping the effective contribution stable as rank grows.

How LoRA Works (the part that creates the pitfalls)

Pitfall 1: Rank That is Too Low (or Too High)

Keep reading with Pro.