Which Layers to Adapt

GPT-3 has 96 transformer layers. Each layer holds at least eight distinct weight matrices: four attention projections (query, key, value, output) and three in the feed-forward block (two linear projections and their interaction). Applying LoRA to every one of them roughly triples the trainable-parameter count compared to adapting only the query and value projections. Whether that extra cost pays off depends on the task, and the answer from careful ablation studies is often "no."

This concept walks through the structural logic of where to adapt, what the empirical evidence actually says, and why seemingly obvious choices (adapt everything, or just the biggest matrices) frequently underperform more targeted strategies.

The transformer as a map of specialised subspaces

Before choosing which layers to touch, it helps to have a mental model of what each component does.

Attention projections route information between token positions. The query (Wq) and key (Wk) projections determine which tokens attend to which; the value (Wv) projection governs what information flows through once attention weights are fixed; the output projection (Wo) re-mixes the attended values before the residual add.

Feed-forward (MLP) blocks are often described as "memory" for factual associations. The up-projection expands the hidden dimension (typically by 4x), the non-linearity fires or suppresses individual "neurons," and the down-projection contracts back. Memorisation-style fine-tuning tasks (injecting new facts) tend to require changes here more than attention changes do.

Layer normalisation and bias terms are comparatively small and are usually left frozen; modifying them rarely provides proportionate benefit.

This decomposition matters because task requirements map unevenly onto these components. A domain-adaption task that mostly needs the model to change its vocabulary register differs from a task that demands new factual recall, which differs again from a task requiring structural reformatting of outputs.

What the LoRA ablations show

Hu et al. (2021) ran systematic ablations on GPT-3 175B fine-tuned on the WikiSQL and MultiNLI benchmarks, restricting their sweep to attention matrices only (MLP adaptation was explicitly left for future work). Their Table 5 finding is worth reading carefully:

Adapting both Wq and Wv gives the best performance overall. Adapting all four attention matrices (Wq, Wk, Wv, Wo) does not consistently improve on Wq+Wv.

The more striking result is the rank-versus-breadth trade-off. Doubling the rank of a single matrix (say, Wq alone at rank 8) underperformed spreading the same parameter budget across Wq and Wv both at rank 4. The same number of trainable parameters, distributed more broadly, captured the necessary adaptation better. The paper's explanation is that even rank-4 updates capture enough of the required subspace; the bottleneck is coverage, not depth.

Target matrices	Trainable params (relative)	WikiSQL accuracy
Wq only (r=8)	1x	lower
Wq + Wv (r=4 each)	1x	higher
Wq + Wk + Wv + Wo (r=2 each)	1x	roughly equivalent to Wq+Wv
All four + MLP (estimated)	~3-4x	not tested in paper

The practical implication: when you have a fixed parameter budget, prefer horizontal spread (more matrices, lower rank) over vertical concentration (one matrix, higher rank).

Attention versus MLP: the task-dependence question

The LoRA paper's decision to freeze MLP layers was pragmatic, not principled. Follow-on work has clarified when each component matters more.

For instruction-following and style adaptation tasks (summarisation, tone, format), attention weight updates are usually sufficient. These tasks alter how the model attends and re-weights existing knowledge; they do not require storing new facts.

For knowledge-injection tasks (domain-specific QA, medical or legal concepts absent from pre-training), the MLP blocks become important. The feed-forward sublayers function as key-value stores encoding factual associations; knowledge-editing literature (Meng et al., ROME) specifically targets MLP layers for this reason.

A rough heuristic (not a hard rule):

Task needs  -> Target these
------------------------------------
Style / tone       -> Wq, Wv (attention)
New factual recall -> MLP down-projection
Both               -> Wq, Wv + MLP (higher budget)
Minimal compute    -> Wq + Wv only, low rank

The LLM-Adapters study (Hu et al., 2023, arXiv:2304.01933) confirms this asymmetry empirically across arithmetic and commonsense reasoning benchmarks: adapter placement in MLP sublayers contributed more on knowledge-heavy tasks, while attention placement sufficed for structural reformatting.

Layer depth: early, middle, or late?

Transformer layers are not all equivalent. There is a well-known gradient from syntactic processing in early layers to semantic and task-specific processing in later layers. Several studies have probed whether layer depth affects adaptation efficiency:

Early layers encode low-level linguistic structure. Adapting them can be useful when the target domain has a significantly different surface form (e.g., code, scientific notation, non-standard tokenisation). For most NLP tasks, early layers are stable across domains and do not need updating.

Middle layers handle abstract semantic composition. Most task-specific information lives here. When resources are constrained to adapting only a subset of layers, middle layers (roughly 30-70% of total depth) tend to give the best return.

Late layers govern output formatting and task-specific decoding patterns. Adapting these matters for instruction-following and output-structure tasks. QLoRA (Dettmers et al., 2023) found that combining adapters across all layers of a 4-bit quantised model worked well in practice, but this is partly because QLoRA's per-layer overhead is already small under quantisation.

A practical approach for layer selection when compute is tight:

# Pseudo-config for selective LoRA application
lora_target_modules = [
    "q_proj", "v_proj"       # attend + blend
]
lora_layers_to_transform = list(range(16, 32))  # middle + late layers of a 32-layer model

This is not a recipe to copy blindly; it is a starting point for ablating against your task.

Adapter insertion geometry: sequential versus parallel

Houlsby-style adapters (Houlsby et al., 2019, arXiv:1902.00751) insert a bottleneck module sequentially inside the residual stream, once after the attention sublayer and once after the FFN sublayer. This double-injection was chosen for flexibility; each adapter sees the full post-sublayer representation.

Later work introduced parallel adapters, where the adapter branch runs alongside the main sublayer rather than inside it. The merged result is the sum of both paths. This has two advantages: the adapter does not add to the sequential depth of the forward pass (no latency increase at inference if merged), and the gradients for the adapter path are independent of the frozen sublayer's gradients (sometimes faster convergence).

Sequential (Houlsby):            Parallel:
  x -> Attn -> Adapter -> + x     x -> Attn --------> +
                                  x -> Adapter ------> +

The parallel design converges toward LoRA as the bottleneck dimension grows: at the limit, a full-rank parallel adapter on Wv is equivalent to adding a low-rank correction to Wv. LoRA can be seen as a special case of parallel adaptation with rank as the explicit bottleneck.

When it falls down

Uniform application is expensive without proportionate gains. Applying LoRA to every projection in every layer of a 70B-parameter model can add tens of millions of trainable parameters. Unless the task genuinely requires broad distributional shift, this overshoots the necessary budget.

Freezing MLP layers on knowledge-injection tasks causes hallucination to persist. The attention-only default works well on instruction-tuning benchmarks (which reward format compliance) and can mask poor factual accuracy. If your evaluation is on factual recall, measure it directly.

Layer selection that worked for LLaMA may not transfer to a different architecture. Models with different MLP-to-attention ratios (e.g., mixture-of-experts models, models with gated linear units) have different sensitivity profiles. Always sanity-check with a small sweep on a held-out validation set.

Very low rank (r=1 or r=2) on deep adaptation can be undertrained. If the task involves learning multiple simultaneously required behaviours (instruction following + new vocabulary + style shift), a rank too low to span all required directions leads to partial adaptation and erratic outputs.

Merging strategies break down with conflicting adapter targets. LoRA merge (adding the learned delta back into the base weights) assumes the deltas are small relative to the original magnitude. If you train with a high rank and large learning rate, the update can overwrite the pre-trained basin, making the merged model worse than the fine-tuned adapter alone.