QLoRA: 4-bit Base with LoRA

Fine-tuning a 65B-parameter LLaMA model in standard 16-bit precision requires roughly 780 GB of GPU memory just for weights - well beyond any single card available in 2023. QLoRA (Dettmers et al., May 2023) brought that number down to 48 GB. The trick is not a new training algorithm but a carefully engineered storage format: freeze the base model in 4-bit, dequantise on-the-fly for each forward pass, and route gradients only through the small LoRA matrices sitting on top.

LoRA Recap: What QLoRA Builds On

LoRA (Hu et al., 2021) freezes all pre-trained weights and injects trainable low-rank matrices into the linear projections of each transformer layer. For a weight matrix $W \in \mathbb{R}^{d \times k}$, the update is:

\[W' = W + \Delta W = W + BA\]

where $B \in \mathbb{R}^{d \times r}$, $A \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$. Training $B$ and $A$ instead of $W$ reduces trainable parameters by a factor of roughly $dk / (r(d+k))$, which is on the order of 10,000x for GPT-3-scale models.

The base model never moves: its weights are stored once and shared. This is the property QLoRA exploits - if you can store those frozen weights more compactly, you win memory without touching the training logic.

The Three Innovations in QLoRA

QLoRA stacks three orthogonal ideas; together they make the savings additive.

1. NF4: A Better 4-bit Data Type

Standard INT4 assigns equal spacing between quantisation levels, which is wasteful for neural network weights. Empirically, pre-trained transformer weights follow a roughly zero-centred normal distribution. NF4 (Normal Float 4) allocates its 16 possible values at the quantiles of a standard normal distribution, so each bin covers an equal probability mass:

\[q_i = \Phi^{-1}\!\left(\frac{i}{16}\right), \quad i = 0, 1, \ldots, 15\]

where $\Phi^{-1}$ is the normal quantile function. This is "information-theoretically optimal" for normally distributed data - no other 4-bit assignment carries more information about such weights. Each weight block is normalised to $[-1, 1]$ before quantisation, so a single 32-bit scaling constant is stored per block.

In practice NF4 consistently outperforms FP4 and INT4 on downstream benchmarks because it wastes fewer code-points on the tails.

2. Double Quantisation

The 32-bit scaling constants themselves cost memory. For a block size of 64, each constant covers 64 weights, adding $32/64 = 0.5$ bits per parameter. Double quantisation quantises those constants a second time to 8-bit floats, reducing the overhead to roughly $8/64 + 32/(64 \times 256) \approx 0.127$ bits per parameter. The paper reports an average saving of 0.4 bits per parameter from this step, which across a 65B-parameter model amounts to roughly 3.25 GB.

3. Paged Optimisers

Even with compressed weights, brief GPU-memory spikes occur during long sequences or gradient accumulation steps. QLoRA uses NVIDIA's unified memory to move optimiser states (momentum and variance tensors for Adam) to CPU RAM when the GPU overflows, then pages them back before the next update. This prevents out-of-memory crashes that would otherwise force a shorter effective batch size.

How the Three Pieces Fit Together

Component	Stored precision	Updated during training
Base model weights	NF4 (4-bit)	No
Dequantised activations	BF16 (16-bit, temporary)	N/A
LoRA matrices A, B	BF16 (16-bit)	Yes
Optimiser states	FP32, paged to CPU	Yes

During the forward pass, each NF4 weight block is dequantised to BF16 on the fly, used for matrix multiplication, then discarded. Gradients flow back through this dequantisation step into the LoRA matrices only; the base weights receive no gradient and are never updated.

What the Numbers Actually Look Like

A concrete comparison for a 65B-parameter model:

Method	Memory (weights)	Trainable params	GPU needed
Full fine-tune (BF16)	~130 GB	65B	8x A100 80GB
LoRA (BF16 base)	~130 GB	~300M	4x A100 80GB
QLoRA (NF4 base)	~33 GB	~300M	1x A100 80GB

The Guanaco models trained with QLoRA on instruction-following data reached 99.3% of ChatGPT's performance on the MT-Bench benchmark after roughly 24 hours on a single GPU.

Practical Setup

The minimal code path with Hugging Face PEFT and bitsandbytes:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, get_peft_model, LoraConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 data type
    bnb_4bit_use_double_quant=True,      # double quantisation
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)  # enables gradient checkpointing

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules="all-linear",   # QLoRA-style: target every linear layer
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 13,057,064,960 || trainable%: 0.3213

Two configuration choices matter most:

bnb_4bit_compute_dtype=torch.bfloat16: dequantisation happens in BF16. Using FP16 here risks overflow on large-magnitude activations.
target_modules="all-linear": the original QLoRA paper applies LoRA to every linear projection, not just attention. This is important because skipping layers leaves quantisation error uncorrected in those layers.

When It Falls Down

Quantisation error compounds on unusual weight distributions. NF4's optimality relies on weights being normally distributed. After heavy pre-training, most transformer weights satisfy this loosely, but specialised layers (embedding tables, output projection, some MoE gates) sometimes do not. Applying NF4 uniformly can degrade these layers more than expected.

Merged-weight inference reverts to BF16 size. The 4-bit savings exist only during training. If you merge LoRA adapters back into the base weights, the result is a full BF16 model. Serving a QLoRA-trained model at compressed size requires a separate inference quantisation step (GPTQ, AWQ, or keeping bitsandbytes loaded).

Throughput is lower than BF16 LoRA. Dequantisation kernels add overhead per forward pass. On an A100 you typically see 20-30% slower training throughput compared to running the same model unquantised with LoRA. The tradeoff is memory vs. speed, not free lunch.

CPU fallback for paged optimisers adds latency. When optimiser states page to CPU RAM and back, PCIe bandwidth becomes a bottleneck. On models with heavy Adam state usage and long sequences, this can cause irregular per-step timing.

Very small ranks under-adapt on hard tasks. QLoRA does not change how LoRA adapts the model; a rank that is too low for the task complexity will still underfit, 4-bit base or not. The memory savings allow you to train larger models, but they do not substitute for rank selection.

bitsandbytes is GPU-only. As of mid-2025, the NF4 quantisation kernels require CUDA. CPU-only or MPS-only environments (some macOS setups) cannot use QLoRA directly.