Applied LLMs
QLoRA: 4-bit Base with LoRA
QLoRA reduces the GPU memory needed to fine-tune a 65-billion-parameter model from hundreds of gigabytes to a single 48 GB card by storing the frozen base model in 4-bit precision and training only small LoRA adapters at full 16-bit precision.
intermediate · 7 min read
Fine-tuning a 65B-parameter LLaMA model in standard 16-bit precision requires roughly 780 GB of GPU memory just for weights - well beyond any single card available in 2023. QLoRA (Dettmers et al., May 2023) brought that number down to 48 GB. The trick is not a new training algorithm but a carefully engineered storage format: freeze the base model in 4-bit, dequantise on-the-fly for each forward pass, and route gradients only through the small LoRA matrices sitting on top.
LoRA Recap: What QLoRA Builds On
LoRA (Hu et al., 2021) freezes all pre-trained weights and injects trainable low-rank matrices into the linear projections of each transformer layer. For a weight matrix \(W \in \mathbb{R}^{d \times k}\), the update is:
\[W' = W + \Delta W = W + BA\]where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\), and \(r \ll \min(d, k)\). Training \(B\) and \(A\) instead of \(W\) reduces trainable parameters by a factor of roughly \(dk / (r(d+k))\), which is on the order of 10,000x for GPT-3-scale models.
The base model never moves: its weights are stored once and shared. This is the property QLoRA exploits - if you can store those frozen weights more compactly, you win memory without touching the training logic.
The Three Innovations in QLoRA
QLoRA stacks three orthogonal ideas; together they make the savings additive.
1. NF4: A Better 4-bit Data Type
Standard INT4 assigns equal spacing between quantisation levels, which is wasteful for neural network weights. Empirically, pre-trained transformer weights follow a roughly zero-centred normal distribution. NF4 (Normal Float 4) allocates its 16 possible values at the quantiles of a standard normal distribution, so each bin covers an equal probability mass:
\[q_i = \Phi^{-1}\!\left(\frac{i}{16}\right), \quad i = 0, 1, \ldots, 15\]where \(\Phi^{-1}\) is the normal quantile function. This is "information-theoretically optimal" for normally distributed data - no other 4-bit assignment carries more information about such weights. Each weight block is normalised to \([-1, 1]\) before quantisation, so a single 32-bit scaling constant is stored per block.
In practice NF4 consistently outperforms FP4 and INT4 on downstream benchmarks because it wastes fewer code-points on the tails.
2. Double Quantisation
The 32-bit scaling constants themselves cost memory. For a block size of 64, each constant covers 64 weights, adding $32/64 = 0.5$ bits per parameter. Double quantisation quantises those constants a second time to 8-bit floats, reducing the overhead to roughly $8/64 + 32/(64 \times 256) \approx 0.127$ bits per parameter. The paper reports an average saving of 0.4 bits per parameter from this step, which across a 65B-parameter model amounts to roughly 3.25 GB.
3. Paged Optimisers
Even with compressed weights, brief GPU-memory spikes occur during long sequences or gradient accumulation steps. QLoRA uses NVIDIA's unified memory to move optimiser states (momentum and variance tensors for Adam) to CPU RAM when the GPU overflows, then pages them back before the next update. This prevents out-of-memory crashes that would otherwise force a shorter effective batch size.
How the Three Pieces Fit Together
| Component | Stored precision | Updated during training |
|---|---|---|
| Base model weights | NF4 (4-bit) | No |
| Dequantised activations | BF16 (16-bit, temporary) | N/A |
| LoRA matrices A, B | BF16 (16-bit) | Yes |
| Optimiser states | FP32, paged to CPU | Yes |
During the forward pass, each NF4 weight block is dequantised to BF16 on the fly, used for matrix multiplication, then discarded. Gradients flow back through this dequantisation step into the LoRA matrices only; the base weights receive no gradient and are never updated.
What the Numbers Actually Look Like
A concrete comparison for a 65B-parameter model:
| Method | Memory (weights) | Trainable params | GPU needed |
|---|---|---|---|
| Full fine-tune (BF16) | ~130 GB | 65B | 8x A100 80GB |
| LoRA (BF16 base) | ~130 GB | ~300M | 4x A100 80GB |
| QLoRA (NF4 base) | ~33 GB | ~300M | 1x A100 80GB |
The Guanaco models trained with QLoRA on instruction-following data reached 99.3% of ChatGPT's performance on the MT-Bench benchmark after roughly 24 hours on a single GPU.
Practical Setup
The minimal code path with Hugging Face PEFT and bitsandbytes:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 data type
bnb_4bit_use_double_quant=True, # double quantisation
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model) # enables gradient checkpointing
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules="all-linear", # QLoRA-style: target every linear layer
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 13,057,064,960 || trainable%: 0.3213
Two configuration choices matter most:
bnb_4bit_compute_dtype=torch.bfloat16: dequantisation happens in BF16. Using FP16 here risks overflow on large-magnitude activations.target_modules="all-linear": the original QLoRA paper applies LoRA to every linear projection, not just attention. This is important because skipping layers leaves quantisation error uncorrected in those layers.
When It Falls Down
Quantisation error compounds on unusual weight distributions. NF4's optimality relies on weights being normally distributed. After heavy pre-training, most transformer weights satisfy this loosely, but specialised layers (embedding tables, output projection, some MoE gates) sometimes do not. Applying NF4 uniformly can degrade these layers more than expected.
Merged-weight inference reverts to BF16 size. The 4-bit savings exist only during training. If you merge LoRA adapters back into the base weights, the result is a full BF16 model. Serving a QLoRA-trained model at compressed size requires a separate inference quantisation step (GPTQ, AWQ, or keeping bitsandbytes loaded).
Throughput is lower than BF16 LoRA. Dequantisation kernels add overhead per forward pass. On an A100 you typically see 20-30% slower training throughput compared to running the same model unquantised with LoRA. The tradeoff is memory vs. speed, not free lunch.
CPU fallback for paged optimisers adds latency. When optimiser states page to CPU RAM and back, PCIe bandwidth becomes a bottleneck. On models with heavy Adam state usage and long sequences, this can cause irregular per-step timing.
Very small ranks under-adapt on hard tasks. QLoRA does not change how LoRA adapts the model; a rank that is too low for the task complexity will still underfit, 4-bit base or not. The memory savings allow you to train larger models, but they do not substitute for rank selection.
bitsandbytes is GPU-only. As of mid-2025, the NF4 quantisation kernels require CUDA. CPU-only or MPS-only environments (some macOS setups) cannot use QLoRA directly.
Further Reading
- Dettmers, T. et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314. https://arxiv.org/abs/2305.14314
- Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. https://arxiv.org/abs/2106.09685
- Hugging Face PEFT documentation on quantisation-based training. https://huggingface.co/docs/peft/developer_guides/quantization
- Hugging Face blog: "Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA." https://huggingface.co/blog/4bit-transformers-bitsandbytes