Do Transformers Need Three Projections? Rethinking Q, K, and V

Open any transformer implementation and the first thing you see inside an attention block is three linear layers: one that turns the input into queries, one into keys, one into values. They have been there since 2017, copied into every GPT, Llama, and vision transformer since, and almost nobody asks whether all three are pulling their weight. The default carries the authority of a billion downloads.

A paper from a Brainchip research team, Do Transformers Need Three Projections? Systematic Study of QKV Variants (Kayyam, Madan Gopal & Lewis, 2026, arXiv:2606.04032), takes the question literally. The authors tie pairs of projections together, delete one entirely, and retrain from scratch across synthetic tasks, vision, and language modeling up to 1.2B parameters on 10B tokens. The headline result is narrow but useful: keys and values can share a single projection, cutting the inference-time KV cache in half, while perplexity moves only about 3%.

Why this matters: The KV cache, not the model weights, is what fills your GPU during long-context serving. A 32k-token session for a 1.2B model can need almost 6 GB of cache per request. Halving that is the difference between 15 and 30 concurrent users on the same A100, and the saving compounds with the head-sharing tricks (GQA, MQA) you already use.

TL;DR

Standard attention learns three projection matrices ($W_Q$, $W_K$, $W_V$). The paper asks which sharings survive retraining and which break the model.
Q-K=V (keys and values share one projection, query stays separate) is the winner: it keeps attention asymmetric, cuts the KV cache by 50%, and costs roughly 3.1% perplexity at 300M and 2.48% at 1.2B.
Q=K-V (query and key share, value separate) looks fine in training but is a trap: it forces a symmetric attention matrix, breaks causal language modeling, and saves zero cache because you still store K and V separately.
Q=K=V (one projection for all three) collapses quality badly in language (about +25% perplexity), yet is competitive or better on non-causal vision tasks.
The reason Q-K=V works is empirical and concrete: across layers, learned K and V projections have high cosine similarity (~0.73) and nearly identical effective rank, while Q stays distinct. Keys and values occupy similar representational space; queries do not.
Projection sharing is orthogonal to head sharing. Stacking Q-K=V on top of GQA or MQA multiplies the savings: Q-MQA reaches 96.9% cache reduction at about 4.8% perplexity cost.
Perplexity overstates the damage. The 1.2B Q-K=V model trails the baseline by 2.48% perplexity but only 0.41% on average 5-shot downstream accuracy.

At a Glance

The four configurations the paper studies differ only in which projection weights are tied. Everything downstream of the projections (softmax, the value mixing, the output projection) is unchanged.

flowchart LR
  X[Input tokens X] --> WQ[W_Q]
  X --> WK[W_K]
  X --> WV[W_V]
  WQ --> Q[Queries]
  WK --> K[Keys]
  WV --> V[Values]
  Q --> S["Scores QKᵀ"]
  K --> S
  S --> SM[softmax] --> O["Output = A·V"]
  V --> O
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
  classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
  class X blue
  class WQ,WK,WV,S,SM purple
  class Q,K,V,O teal

The variants ask: what if some of those three weight matrices are the same matrix? Tying W_K = W_V gives Q-K=V; tying W_Q = W_K gives Q=K-V; tying all three gives Q=K=V.

Figure 1 from the paper: the four projection-shared attention variants and their formulations. The (X)+ notation marks variants augmented with 2D positional encoding.

Before the Question Was Asked

The three-projection design was never argued for; it was simply the shape that worked first. When Vaswani et al. (2017, Attention Is All You Need, arXiv:1706.03762) wrote down scaled dot-product attention, queries, keys, and values were three distinct learned views of the input, and the field inherited that as bedrock. The efficiency work that followed spent a decade attacking almost everything except the projection count.

timeline
  title Where attention efficiency work has focused
  2017 : Attention Is All You Need : three projections, full O(n²) attention
  2019 : Multi-Query Attention : share K,V across all heads
  2020 : Linformer, Performer : low-rank and kernel approximations of softmax
  2023 : Grouped-Query Attention : tunable middle ground between MHA and MQA
  2024 : Multi-head Latent Attention : compress K,V into a latent for the cache
  2026 : QKV variants study : tie the projections themselves

Two families dominate that history. The first attacks the quadratic softmax: Linformer projects the sequence length down (Wang et al., 2020, arXiv:2006.04768), Performer replaces softmax with a kernel feature map. The second attacks the KV cache by sharing key and value heads: Multi-Query Attention collapses all heads to one shared K and V (Shazeer, 2019, arXiv:1911.02150), and Grouped-Query Attention interpolates between full multi-head and MQA with a tunable number of groups (Ainslie et al., 2023, arXiv:2305.13245). DeepSeek's Multi-head Latent Attention pushes furthest, compressing keys and values into a shared low-rank latent and reconstructing them on demand (DeepSeek-AI, 2024, arXiv:2405.04434).

Every one of those methods keeps three separate projection matrices. They share heads, approximate softmax, or compress the cache after projection. The question of whether the input needs three independent linear views in the first place stayed open, and that gap is exactly what this paper steps into.

Standard attention computes three projections of the same input $X$, then mixes them:

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\] \[\text{Attention}(Q,K,V) = \text{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right)V\]

The scores matrix $QK^\top$ decides where each token looks; the value matrix $V$ decides what it pulls back. Each projection plays a different role, and the question is how separable those roles really are. Tying two projections is a hard constraint, not a soft regularizer: the two matrices become literally the same parameters, and the model has to retrain around the loss of expressivity.

The three constraints

Read the variant names as a binding diagram. The = ties projections together; the - keeps them apart.

Q-K=V (the practical one). Set $W_K = W_V = W$. Now keys and values are the same tensor, $K = V = XW$, while the query keeps its own matrix $W_Q$. The scores stay asymmetric because $Q \ne K$:

\[\text{Attention} = \text{softmax}\!\left(\tfrac{QK^\top}{\sqrt{d_k}}\right)K\]

This is the one that pays off at inference. During autoregressive decoding you cache one tensor per layer instead of two, because the value you need is just the key you already stored.

Q=K-V (the trap). Set $W_Q = W_K = W$, leaving the value separate. Queries and keys become identical, so the score matrix is $KK^\top$, which is symmetric: token $i$ attending to token $j$ now necessarily equals $j$ attending to $i$.

\[\text{Attention} = \text{softmax}\!\left(\tfrac{\alpha\, KK^\top}{\sqrt{d_k}}\right)V\]

Symmetric attention is fatal for causal language modeling, where "the cat sat" must attend differently than "sat cat the." And because $V$ is still a distinct tensor, you still cache both K and V, so there is no inference saving to compensate for the quality hit.

Q=K=V (the extreme). One matrix for everything: $A = \text{softmax}(\alpha\, KK^\top)K$. This inherits the symmetry pathology and squeezes all three roles through a single bottleneck. It is the most aggressive constraint and, in language, the most damaging.

graph TD
  subgraph QKV [Standard QKV]
    A1[W_Q] --- A2[W_K] --- A3[W_V]
  end
  subgraph QKV2 [Q-K=V: share K,V]
    B1[W_Q separate] -.-> B2["W_K = W_V (one matrix)"]
  end
  subgraph QKV3 [Q=K-V: share Q,K]
    C1["W_Q = W_K (one matrix)"] -.-> C2[W_V separate]
  end
  subgraph QKV4 [Q=K=V: one matrix]
    D1["W_Q = W_K = W_V"]
  end
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef rose fill:#be123c,stroke:#fb7185,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  class A1,A2,A3 blue
  class B2 emerald
  class C1 rose
  class D1 amber

Why keys and values are redundant but queries are not

The paper does not just observe that Q-K=V works; it measures why. After training a standard transformer, the authors inspect the learned projection matrices and find a clean asymmetry. Keys and values are close neighbors in weight space: their projection matrices show cosine similarity around 0.73 across layers, with nearly identical effective rank (about 687 versus 702 out of 1024 dimensions). Queries sit apart, with cosine similarity near 0.42 to keys and 0.31 to values.

That is the whole story in one sentence: keys and values are learning to look at the input in almost the same way, so forcing them to share one matrix throws away little; queries are doing something genuinely different, so collapsing the query is what hurts.

The effective-rank numbers also reinforce a long-standing observation that attention lives in a low-rank regime, where the score and value subspaces use far fewer dimensions than the model's width. Earlier work argued multi-head attention can suffer a low-rank bottleneck when head dimension is small (Bhojanapalli et al., 2020, arXiv:2002.07028); here the same low-rank character is what makes the redundancy harvestable.

The "+" augmentation

For non-causal settings (vision, some synthetic tasks) the symmetric variants are less doomed, because position is not strictly ordered. The paper adds a 2D positional encoding to the symmetric variants, written (X)+, to restore directional asymmetry without restoring the projection. On vision tasks the augmented variants close most of the small gap to the baseline, which tells you the symmetry problem is partly about missing directional signal rather than missing parameters.

Seeing It in Motion

The payoff lives at decode time. During autoregressive generation, every previous token's K and V sit in the cache so the new token can attend to them. The sequence diagram below contrasts what gets stored per step.

sequenceDiagram
  participant U as New token
  participant C as KV cache
  participant A as Attention
  Note over C: Standard QKV stores 2 tensors/token
  U->>A: compute Q, K, V
  A->>C: append K and V
  C-->>A: all past K, all past V
  A-->>U: attention output
  Note over C: Q-K=V stores 1 tensor/token
  U->>A: compute Q and K (K reused as V)
  A->>C: append K only
  C-->>A: all past K (serves as V too)
  A-->>U: attention output

The cache is the bottleneck, not the matmul. A standard model writes both K and V for every token at every layer; Q-K=V writes only K. The decision of which variant to ship then comes down to how much quality you can trade for how much memory, which is a clean flowchart rather than a research question.

flowchart TD
  Start{Deployment target?} --> Cloud[Cloud, quality first]
  Start --> Edge[Edge, balanced]
  Start --> IoT[Mobile / IoT, memory first]
  Cloud --> G["GQA-4: 75% cache cut, +0.7% ppl"]
  Edge --> Q["Q-K=V: 50% cut, +3.1% ppl"]
  Edge --> QG["Q-GQA-4: 87.5% cut, +3.9% ppl"]
  IoT --> QM["Q-MQA: 96.9% cut, +4.8% ppl"]
  classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
  classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
  classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
  class Start blue
  class G,Q emerald
  class QG,QM amber

By the Numbers

The language-modeling results are the ones that matter, since that is where the symmetric variants fail and where the cache savings pay rent. All language numbers below come from the paper's SlimPajama runs (a cleaned RedPajama corpus), 10B tokens, with 5-shot downstream evaluation.

300M parameters, validation perplexity:

Model	Val PPL	vs QKV	KV cache cut
QKV (baseline)	5.11	reference	none
Q-K=V	5.27	+3.1%	50%
Q=K-V	5.36	+4.9%	0%
Q=K=V	6.41	+25.4%	50%
GQA-4	5.15	+0.7%	75%
MQA	5.19	+1.5%	93.8%
Q-GQA-4	5.32	+3.9%	87.5%
Q-MQA	5.36	+4.8%	96.9%

Two things jump out. Q=K-V is strictly worse than Q-K=V on quality and saves nothing, so it is dominated. And the combined rows (Q-GQA, Q-MQA) show projection sharing stacking on head sharing: Q-MQA reaches 96.9% cache reduction, which is MQA's head saving multiplied by Q-K=V's projection saving.

1.2B parameters, the same picture at scale:

Model	Val PPL	vs QKV	Params (M)	Cache @32k
QKV	5.004	reference	1,215	5,900 MB
Q-K=V	5.128	+2.48%	1,123	2,950 MB
GQA-8	5.030	+0.52%	1,077	1,408 MB
MQA	5.057	+1.06%	1,036	176 MB
Q-GQA-8	5.158	+3.08%	1,054	704 MB
Q-MQA	5.212	+4.16%	1,033	88 MB

The degradation from sharing K and V shrinks with scale, from 3.1% at 300M to 2.48% at 1.2B, which is the encouraging direction: the trick looks more attractive for the large models that actually strain serving budgets, not less.

Downstream accuracy decouples from perplexity. A 2.48% perplexity gap sounds alarming until you look at the tasks:

Model	Avg 5-shot accuracy	vs QKV
QKV	36.40%	reference
Q-K=V	35.99%	-0.41%
GQA-8	35.86%	-0.54%
MQA	36.37%	-0.03%
Q-GQA-8	36.72%	+0.32%

Across HellaSwag, PIQA, ARC-Easy, ARC-Challenge, and WinoGrande, Q-K=V loses 0.41 points of average accuracy for its 50% cache cut. Small pretraining-loss differences are a poor predictor of task behavior, a caution worth keeping whenever you reach for perplexity as your only yardstick.

Where the parameters and compute actually go. Cutting a projection sounds like a third off attention, but attention is a minority of the model. The projection matrices are $3d^2$ parameters in the baseline and $2d^2$ under a pairwise share, a 33% cut within attention, but attention is only about 30% of the model; MLP and embeddings dominate.

Quantity (300M model)	QKV	Q-K=V	Q=K=V
Attention projection ops	$3nd^2$	$2nd^2$	$nd^2$
Total parameters	305.5M	284.5M (-6.9%)	263.6M (-13.7%)
Inference MACs @2048 tok	792.7G	749.7G (-5.4%)	706.8G (-10.8%)

The lesson is to be honest about which budget you are spending. Q-K=V is a KV-cache optimization first, with modest parameter and FLOP side effects; if you came for a 33% smaller model, attention is the wrong third of it to shave.

Figure 2 from the paper: training loss and validation accuracy on TinyImageNet. Q=K=V trains fastest and, on this non-causal task, reaches the best top-1 accuracy.

[IMAGE: grouped bar chart of KV-cache size in GB at 32k context for QKV, Q-K=V, GQA-8, MQA, Q-GQA-8, Q-MQA on the 1.2B model, log y-axis, annotated with the 5,900 MB to 88 MB span]

A Concrete Example

Put numbers on the serving math, because that is where the abstraction becomes a procurement decision. Take the 1.2B model, 32k-token context, bf16. The per-token KV cache is set by the number of layers, heads, and head dimension; for this configuration the baseline needs roughly 5,900 MB of KV cache for a single full-length request.

Now imagine a code-completion service holding 100 concurrent 32k-token sessions on A100 40GB GPUs. A useful rule of thumb the paper works through:

Baseline QKV. Each request reserves about 5.9 GB of cache, so a 40 GB card (minus weights and activations) fits roughly 15 live sessions. Serving 100 users needs about 7 GPUs.
Q-K=V. Each request now reserves about 2.95 GB. The same card holds roughly 30 sessions, so 100 users fit on about 4 GPUs.

At a representative cloud rate, the paper puts that at roughly $14k/month versus $8k/month, on the order of $72k/year saved, about a 43% reduction in serving cost, in exchange for 0.41 points of downstream accuracy. The trade is asymmetric in your favor because the cache, not the weights, is what the extra GPUs were buying.

Trace one decoding step to see why the cache halves. Suppose the model has generated 10,000 tokens and is producing token 10,001 in one attention layer:

Project the new token into $Q_{10001}$ and $K_{10001}$. Under Q-K=V there is no separate $V$ projection; the key is the value.
Append $K_{10001}$ to the cache. The cache now holds 10,001 key vectors and nothing else, where the baseline would hold 10,001 keys plus 10,001 values.
Compute scores $Q_{10001} K^\top$ over all cached keys, softmax, then mix the same cached keys as values to get the output.

Half the writes, half the reads, half the resident memory, and because $Q \ne K$ the attention pattern is still directional, so the model can tell "token 10,001 looks back at token 3" from its reverse. That last clause is exactly what Q=K-V gives up, and why it sits in the results table saving nothing while costing more.

Where It Breaks

The clean win is real but narrow, and the failure modes are instructive.

Symmetric attention is non-negotiable for language. Any variant that ties $W_Q = W_K$ produces a symmetric score matrix, and causal language modeling dies on it: +4.9% perplexity for Q=K-V, +25.4% for Q=K=V at 300M. The 2D positional + augmentation rescues these only in non-causal settings; do not expect it to save a decoder-only LLM.

Q=K-V is a benchmark mirage. In a synthetic or vision table it can look as good as Q-K=V, but it caches both K and V and so delivers no inference benefit. The paper is blunt that it is "unsuitable for production despite comparable training quality." It is the clearest case in the paper of a variant that wins the wrong metric.

The savings are diluted by everything that is not attention. A 33% cut to projection parameters becomes a 6.9% cut to the whole model, and a 5.4% cut to inference MACs at 2048 tokens, because MLP and embeddings carry most of the load. The KV-cache saving is the part that holds its full 50%.

The win grows with sequence length, but only so far. Attention's share of compute rises from about 29% at 128 tokens to roughly 53% at 4096, so projection sharing matters more for long contexts. But the paper's own evaluation stops at 2048-token sequences and 1.2B parameters; length extrapolation and behavior past 7B are explicitly uncharacterized.

The explanation is empirical, not proven. The cosine-similarity and effective-rank evidence is convincing but descriptive. There is no formal result saying K and V must be shareable, and one ablation (Q=V, tying query and value while keeping a separate key) is omitted on the argument that Q plays a fundamentally different addressing role and is not cached anyway.

Figure 3 from the paper: loss over time on the synthetic tasks for QKV, Q=K-V, and (Q=K-V)+, showing the standard model converging fastest while the shared variants track close behind.

[IMAGE: heatmap pair of attention maps for the "reverse" synthetic task, baseline QKV on the left versus the symmetric Q=K-V on the right, annotated to show the forced symmetry along the diagonal]

Alternative Designs

Projection sharing is one axis of attention efficiency among several. The honest framing is that it is complementary to the others, not a replacement.

Approach	What it shares / cuts	Typical KV cache cut	Quality cost	Best when
Multi-Query Attention	One K,V head for all query heads	up to ~93-97%	small to moderate	aggressive memory limits
Grouped-Query Attention	K,V shared within head groups	tunable (e.g. 75%)	very small	quality-sensitive serving
Multi-head Latent Attention	K,V compressed to a latent	very large (DeepSeek reports ~93%)	small, sometimes positive	long context at scale
Linformer / Performer	Approximate the softmax itself	n/a (attacks compute)	task-dependent	very long sequences
Q-K=V (this paper)	One projection matrix for K and V	50%	~2.5-3% perplexity	stacking on the above

MQA and GQA share key and value heads but keep three projection matrices; MLA keeps three projections and compresses their output. Q-K=V cuts in a different place, the projection matrix, so it composes with all of them. That composability is the most actionable claim in the paper: Q-GQA-4 and Q-MQA exist precisely because the two savings multiply rather than overlap. If you already run GQA, Q-K=V is not a rival to evaluate against it; it is an extra 50% you can layer on for a few points of perplexity.

How It Is Used in Practice

No major production model ships Q-K=V today; the work is fresh and the public artifact is a research repository (Brainchip-Inc/Do-Transformers-Need-3-Projections). What the paper offers practitioners is a deployment map keyed to constraints rather than a drop-in default.

The realistic adoption path is incremental. Teams already serving long-context models on GQA or MQA can prototype the combined variants, since the change is local to the attention block and leaves the rest of the architecture untouched. The strongest case is memory-bound edge and on-device inference, where the paper points at Q-MQA's 96.9% cache reduction as what makes a 1.2B-class model fit "practical on-device" budgets. For cloud serving where quality is paramount, the data still favors plain GQA-4 (75% reduction at only 0.7% perplexity), with Q-K=V entering when you need to push past what head sharing alone can give.

The operational caveat is the one the paper itself flags: validate on your downstream tasks, not on perplexity. The 1.2B results show a 2.48% perplexity gap shrinking to 0.41% on benchmarks, and that gap could move either way on a domain the pretraining mix did not cover.

Insights Worth Remembering

The three projections are not equally load-bearing. Keys and values are near-duplicates in weight space; the query is the one doing distinct work. "Three projections" was a convention, not a requirement.
Symmetry is the real constraint, not parameter count. The variants that hurt are the ones that tie $W_Q = W_K$ and force a symmetric attention matrix, which causal language modeling cannot tolerate.
A method can ace the benchmark and still be useless. Q=K-V matches Q-K=V on training quality yet saves zero cache, a reminder to evaluate the metric you actually deploy against.
The KV cache, not the weight matrix, is the prize. A 33% projection cut is only a 7% model, but a full 50% of the cache, and the cache is what limits concurrency.
Efficiency tricks live on different axes. Head sharing (GQA/MQA) and projection sharing (Q-K=V) multiply because they cut in different places; that is why Q-MQA can reach 96.9%.
Perplexity is a lossy proxy for capability. A 2.48% perplexity gap that becomes 0.41% on downstream tasks should make you suspicious of any architecture decision justified by perplexity alone.
Constraints sometimes scale gracefully. The K=V sharing penalty falls from 3.1% to 2.48% going 300M to 1.2B, the opposite of what you fear, suggesting the redundancy is structural rather than an artifact of small models.

Open Questions

The evidence is solid up to 1.2B parameters and 2048-token sequences; everything past that line is inference, and the paper is careful to say so.

Does the K=V penalty keep shrinking at 7B, 70B, and beyond? The measured trend (3.1% to 2.48%) is encouraging but two data points do not make a scaling law. Whether the redundancy between keys and values persists or erodes at frontier scale is unresolved.
What happens at 128k context and beyond? Attention's share of compute keeps rising with length, so the win should grow, but length extrapolation behavior of the shared variants is untested. A symmetric pathology that is mild at 2k could compound at 128k, or not.
Is there a formal reason K and V are shareable? The cosine-similarity argument is descriptive. A theory predicting which heads or layers tolerate sharing would turn a global constraint into a per-layer one, plausibly recovering most of the lost quality.
Can the sharing be learned rather than imposed? A natural next step is a soft or gated tie between $W_K$ and $W_V$ that the model relaxes where it needs distinct values, sitting between full QKV and hard Q-K=V. The paper does not test this; it is community-obvious speculation, not a result.
How does Q-K=V interact with MLA-style latent compression? Both target keys and values. Whether they stack, conflict, or render each other redundant is an open and very practical question for long-context serving.

Do Transformers Need Three Projections? Rethinking Q, K, and V

TL;DR

At a Glance

Before the Question Was Asked

The three constraints

Why keys and values are redundant but queries are not

The "+" augmentation

Seeing It in Motion

By the Numbers

A Concrete Example

Where It Breaks

Alternative Designs

How It Is Used in Practice

Insights Worth Remembering

Open Questions

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Additional Resources

Related reading

TL;DR

At a Glance

Before the Question Was Asked

How Projection Sharing Actually Works

The three constraints

Why keys and values are redundant but queries are not

The "+" augmentation

Seeing It in Motion

By the Numbers

A Concrete Example

Where It Breaks

Alternative Designs

How It Is Used in Practice

Insights Worth Remembering

Open Questions

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Additional Resources

Related reading