IA3 and Scaling-Vector Methods

LoRA already cuts trainable parameters to a fraction of a percent. (IA)^3 cuts them by another order of magnitude, to roughly 0.01% of T0's weights, and still beats few-shot in-context learning on the RAFT benchmark - outperforming the prior state of the art by 6 percentage points absolute. That gap demands an explanation.

The core idea: scale activations, not weight matrices

LoRA injects a low-rank matrix pair into each targeted weight: a left factor of shape [d, r] and a right factor [r, d], so even at rank 1 you carry two vectors totalling 2d parameters per layer. (IA)^3 goes further - it learns a single vector per targeted site and uses it purely as an element-wise multiplier on the existing activations.

Concretely, let h be an intermediate activation in a transformer block. (IA)^3 replaces h with:

h' = l ⊙ h

where l is a learned vector of the same dimension as h, initialised to all-ones (so the model starts as a faithful copy of the pre-trained base), and ⊙ is element-wise multiplication. During training only l receives gradients; the underlying weight matrices remain frozen.

Three sites per transformer block are targeted, following the original paper:

Site	What is scaled	Why
Key projections in self-attention	Output activations	Modulates which tokens attend to which
Value projections in self-attention	Output activations	Controls what information gets aggregated
Second FFN layer (down-projection)	Input activations	Gates the bottleneck expansion before projection back

The distinction between "output" and "input" scaling matters in practice. For the feedforward case, the vector multiplies the inputs before they hit the weight matrix, not the outputs afterwards. The HuggingFace PEFT library exposes this via the feedforward_modules argument in IA3Config.

Why element-wise multiplication works

Scaling activations by a learned vector is equivalent to rescaling the rows (or columns) of the frozen weight matrix from the outside. Suppose the key projection computes K = X W_K. Inserting l_k ⊙ K is the same as computing X (W_K diag(l_k)), which means the effective weight is a column-rescaled version of W_K. The model can freely suppress irrelevant heads or amplify task-relevant directions without touching the stored weights - a clean factorisation of "what the model knows" from "what the task needs".

This is why the initialisation to ones matters: the model's pretrained representations are left intact at the start of training, and the vectors drift only as far as the task gradient requires.

Parameter count in practice

For a model like T0 (roughly 11 billion parameters), (IA)^3 introduces approximately 0.01% additional trainable parameters. LoRA at a moderate rank sits around 0.1% or above. Full fine-tuning is 100%.

A rough estimate for one transformer block with hidden size d and intermediate FFN size d_ff:

parameters per block (IA3) = d_k + d_v + d_ff

Compare to LoRA at rank r applied to the same four attention projections (Q, K, V, O):

parameters per block (LoRA, rank r) = 4 * (d * r + r * d) = 8 * d * r

At r = 8 and d = 1024, LoRA adds 65,536 parameters per block; (IA)^3 adds roughly 4,096 + 4,096 + 4,096 = 12,288 - a factor of roughly 5x fewer.

Serving multiple tasks without weight copies

Because the scaling vectors are tiny compared to the base model, multi-task serving becomes practical. Each task gets its own set of vectors (stored separately, on the order of kilobytes), and swapping tasks at inference time means replacing three learned vectors per layer rather than loading a separate model copy. When latency matters more than memory, the vectors can be merged back into the frozen weights before serving - folding diag(l_k) into W_K - so inference carries zero overhead.

from peft import IA3Config, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM

config = IA3Config(
    task_type=TaskType.SEQ_2_SEQ_LM,
    target_modules=["k", "v", "w0"],
    feedforward_modules=["w0"],
)

base = AutoModelForSeq2SeqLM.from_pretrained("google/t5-base")
model = get_peft_model(base, config)
model.print_trainable_parameters()
# trainable params: ~12,000 | all params: ~250M | trainable%: ~0.005%

The merge_adapter call on a loaded PeftModel folds the vectors back in-place, giving a standard transformers model with no PEFT overhead at inference.

When it falls down

Task complexity. Scaling vectors are a rank-1 transformation per site. For tasks that require genuinely new compositional behaviours - complex multi-step reasoning, code synthesis, instruction following in an entirely new style - the representational capacity of a single scaling vector per activation stream is insufficient. LoRA or full fine-tuning will outperform (IA)^3 in those regimes.

Small base models. The method relies on the pre-trained representations already encoding task-relevant structure; the scaling vectors only need to route and amplify. On smaller models (under roughly 1 billion parameters) this assumption breaks down more often, and the parameter budget saved rarely compensates for the expressivity lost.

Sensitivity to which layers are targeted. Because the method touches only three sites per block, wrong choices (e.g. targeting the output projection instead of keys/values, or skipping the FFN gate) degrade performance noticeably. The defaults from the paper work well for encoder-decoder models; for decoder-only models like LLaMA the target_modules and feedforward_modules need explicit tuning.

Gradient vanishing during initialisation. Initialising to ones means that early in training, the gradient signal flowing through the scaling vectors is identical to what the frozen model would produce. For tasks that require moving far from the pretrained distribution, this can result in slow initial convergence relative to LoRA, which starts from zero and grows.

Not a drop-in replacement for instruction fine-tuning at scale. Production instruction-tuning runs (e.g. for RLHF) have generally favoured LoRA over (IA)^3 because the additional rank provides headroom when the reward model keeps pushing the policy. Scaling vectors tend to saturate earlier.

The core idea: scale activations, not weight matrices

Why element-wise multiplication works

Parameter count in practice

Serving multiple tasks without weight copies

When it falls down

Further reading