Cache-Augmented Generation: When Preloaded KV-Caches Replace Retrieval Pipelines
June 02, 2026 · 22 min read
When the original RAG paper landed in 2020, the pitch was elegant: give a language model access to external documents at inference time so it can ground its answers in real data. Four years later, every production LLM stack seems to include a vector database, an embedding model, a chunking strategy, and a retriever, each introducing its own failure modes. Cache-Augmented Generation (CAG) asks a pointed question: what if you skipped all of that and simply loaded the knowledge into the model's memory before the first query arrived?
Why this matters: RAG pipelines fail in two distinct ways: the retriever misses relevant documents, or it surfaces irrelevant ones. Both corrupt the generation. CAG eliminates the retriever entirely by precomputing the model's key-value cache over the full knowledge base, turning inference into a single forward pass with no runtime search. For bounded knowledge domains, this is not a simplification; it is a fundamentally different architecture.
TL;DR
- CAG preloads an entire knowledge base into the LLM's extended context window and caches the resulting key-value tensors to disk, eliminating the retrieval step entirely.
- KV-cache reuse means subsequent queries skip the expensive prefill computation; only the new query tokens require a forward pass.
- On HotpotQA, CAG achieves a BERTScore of 0.7759 versus 0.7516 for dense RAG (small split), while running 40x faster on the largest split (2.3s vs. 94s).
- The approach works because modern context windows (128k+ tokens) can hold tens of thousands of document tokens, and the KV-cache makes re-reading them free after the first pass.
- CAG's hard ceiling is context window size: if the knowledge base exceeds the window, you must either compress, partition, or fall back to retrieval.
- Prefix caching in production systems (Anthropic, OpenAI, Google, vLLM) applies the same principle at the API level, delivering 50-90% cost and latency reductions for shared prompt prefixes.
- The tradeoff is explicit: CAG trades GPU memory and upfront compute for zero-latency, zero-error retrieval at query time.
At a Glance
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart LR
D["Knowledge<br/>Documents"] --> P["Prefill Pass"]
P --> KV["KV-Cache<br/>(stored to disk)"]
KV --> I["Load Cache +<br/>Append Query"]
I --> G["Generate<br/>Response"]
G --> R["Reset Cache<br/>(truncate query)"]
R --> I
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
class D blue
class P,I purple
class KV teal
class G,R emerald
Before CAG
The problem CAG solves is older than its name. Language models have always faced a tension between what they know (parameters) and what they can see (context). The strategies for bridging that gap define a timeline of increasingly sophisticated workarounds.
%%{init: {'theme': 'base', 'themeVariables': {'cScale0': '#1e40af', 'cScale1': '#6d28d9', 'cScale2': '#b45309', 'cScale3': '#be123c', 'cScale4': '#047857', 'cScale5': '#0e7490', 'cScale6': '#1e40af', 'cScaleLabel0': '#e2e8f0', 'cScaleLabel1': '#e2e8f0', 'cScaleLabel2': '#e2e8f0', 'cScaleLabel3': '#e2e8f0', 'cScaleLabel4': '#e2e8f0', 'cScaleLabel5': '#e2e8f0', 'cScaleLabel6': '#e2e8f0', 'textColor': '#e2e8f0', 'lineColor': '#94a3b8', 'fontSize': '16px'}}}%%
timeline
title From Fixed Context to Cached Knowledge
2017 : Transformer introduced (Vaswani et al.)
: 512-token context window
2019 : Multi-Query Attention (Shazeer)
: First KV-cache size reduction
2020 : RAG introduced (Lewis et al.)
: Retriever + generator pipeline
2023 : PagedAttention / vLLM (Kwon et al.)
: OS-style paged KV memory
: GQA paper (Ainslie et al.)
2024 : 128k-1M token context windows
: Prefix caching in production APIs
: StreamingLLM, H2O eviction
2025 : CAG paper (Chan et al., WWW 2025)
: Hybrid CAG-RAG frameworks
The original Transformer operated on 512 tokens. RAG was a reasonable response to that constraint: if the model cannot hold the documents, fetch the relevant ones at query time. But RAG introduced a retrieval dependency, and retrieval is neither free nor reliable. Sparse retrievers (BM25) miss semantic matches; dense retrievers (embedding similarity) miss lexical matches. Both require an index that must be kept current, a chunking strategy that inevitably splits relevant information across boundaries, and a top-k selection that discards potentially useful context.
The shift to 128k+ context windows in 2024 (Llama 3.1, Gemini 1.5, Claude 3) changed the economics. A model that can hold 85,000 tokens of documents in a single pass does not need a retriever for knowledge bases of that size. The remaining question was cost: processing 85k tokens on every query is expensive. KV-cache precomputation answers that question.
[IMAGE: Timeline bar chart showing context window growth across model generations: GPT-2 (1024), GPT-3 (2048), GPT-3.5 (4096), Claude 2 (100k), Llama 3.1 (128k), Gemini 1.5 Pro (1M), annotated with the year each shipped]
How Cache-Augmented Generation Actually Works
CAG operates in three phases, each corresponding to a distinct computational step. Understanding them requires first understanding what a KV-cache is and why it exists.
The KV-Cache: Why Transformers Remember
In autoregressive generation, a transformer produces one token at a time. At each step, the self-attention mechanism computes queries, keys, and values for every token in the sequence. For a model with \(L\) layers and \(h\) attention heads, each with dimension \(d_k\), the attention computation at layer \(l\) is:
\[\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V\]The critical observation: the key and value projections for token \(i\) at layer \(l\) depend only on token \(i\)'s hidden state at that layer. They do not change when token \(i+1\) is generated. Recomputing them is pure waste.
The KV-cache stores every previously computed \((K, V)\) pair. When generating token \(t+1\), the model computes only the new token's \(Q_{t+1}\), \(K_{t+1}\), \(V_{t+1}\), appends \(K_{t+1}\) and \(V_{t+1}\) to the cache, and runs attention against the full cached sequence. This reduces per-step complexity from \(O(t \cdot d)\) to \(O(d)\) for the projection step, though the attention dot-product itself remains \(O(t \cdot d_k)\) per head.
The memory cost of the KV-cache for a single sequence is:
\[M_{\text{KV}} = 2 \times L \times h_{\text{kv}} \times d_k \times T \times b\]where \(T\) is the sequence length, \(h_{\text{kv}}\) is the number of key-value heads (equal to \(h\) for MHA, fewer for GQA/MQA), and \(b\) is bytes per element (2 for FP16/BF16). For Llama 2 70B with 80 layers, 64 heads, \(d_k = 128\), in BF16: each token costs approximately 2.5 MB of cache. A 50,000-token document occupies roughly 125 GB of KV-cache, which is why cache management matters.
[IMAGE: Diagram showing KV-cache growth across decoding steps: at step 1, one KV pair per layer; at step t, t pairs per layer; with memory bars growing linearly, annotated with per-token memory cost for a 70B model]
Phase 1: Preloading (Offline)
The entire knowledge base \(\mathcal{D}\) (documents, manuals, FAQ entries) is concatenated and fed through the model's prefill pass. This produces the full KV-cache:
\[\mathcal{C}_{\text{KV}} = \text{KV-Encode}(\mathcal{D})\]This is the expensive step. For 85,000 tokens on Llama 3.1 8B across 8 Tesla V100 GPUs, prefill takes tens of seconds. But it happens once. The resulting \(\mathcal{C}_{\text{KV}}\) is serialized to disk and reloaded for every subsequent query, amortizing the cost over thousands of inferences.
Phase 2: Inference (Online)
A user query \(\mathcal{Q}\) arrives. The system loads \(\mathcal{C}_{\text{KV}}\) from disk into GPU memory, appends the query tokens, and generates:
\[\mathcal{R} = \mathcal{M}(\mathcal{Q} \mid \mathcal{C}_{\text{KV}})\]Because the KV-cache for \(\mathcal{D}\) is already computed, the model only runs prefill on the query tokens (typically under 100 tokens). The attention mechanism sees the full knowledge base through the cached keys and values without reprocessing a single document token.
Phase 3: Reset (Between Queries)
After generating a response, the query and response tokens have been appended to the cache. To serve the next query, the system truncates:
\[\mathcal{C}_{\text{KV}}^{\text{reset}} = \text{Truncate}(\mathcal{C}_{\text{KV}},\, t_1, t_2, \ldots, t_k)\]where \(t_1 \ldots t_k\) are the positions of the appended tokens. This restores the cache to its original document-only state, ready for the next query with no recomputation.
[IMAGE: Side-by-side memory layout diagrams: left shows RAG's per-query allocation (embedding lookup + retrieved chunk encoding + generation), right shows CAG's persistent cache block with a thin query appendix, highlighting the memory reuse]
%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'labelBoxBkgColor': '#1e293b', 'labelBoxBorderColor': '#334155', 'labelTextColor': '#e2e8f0', 'loopTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '16px'}}}%%
sequenceDiagram
participant U as User
participant S as CAG Server
participant C as KV-Cache (Disk)
participant M as LLM
Note over S,C: Offline: one-time preloading
S->>M: Feed full knowledge base D
M-->>C: Serialize KV-cache to disk
Note over U,M: Online: per-query inference
U->>S: Query Q
S->>C: Load cached KV tensors
C-->>S: Return cached KV
S->>M: Append Q tokens + cached KV
M-->>S: Generate response R
S-->>U: Return R
S->>C: Truncate (reset to D-only state)
Seeing It in Motion
RAG vs. CAG: Architectural Comparison
The structural difference between RAG and CAG is not incremental; it is a different topology. RAG has a retrieval loop in the critical path. CAG has a cache load.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TB
subgraph RAG["RAG Pipeline"]
Q1["Query"] --> E1["Embed Query"]
E1 --> VS["Vector Search"]
VS --> RK["Rank + Top-k"]
RK --> CH["Chunk Assembly"]
CH --> LLM1["LLM Prefill<br/>(query + chunks)"]
LLM1 --> R1["Response"]
end
subgraph CAG["CAG Pipeline"]
Q2["Query"] --> LD["Load KV-Cache"]
LD --> AP["Append Query<br/>Tokens"]
AP --> LLM2["LLM Decode<br/>(cached context)"]
LLM2 --> R2["Response"]
end
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
class Q1,Q2 blue
class E1,VS,RK,CH amber
class LLM1,LLM2,LD,AP purple
class R1,R2 emerald
RAG's critical path includes embedding computation, vector similarity search, ranking, and chunk assembly before the LLM ever sees the query. Each step can fail: the embedder might map a query to the wrong region of the vector space; the top-k selection might exclude a relevant passage just below the threshold; the chunking might split a key sentence across two chunks, with neither selected. CAG's critical path is: load a file, append tokens, decode.
KV-Cache Compression Landscape
The memory cost of full KV-caches has spawned a parallel research effort in cache compression. These techniques are orthogonal to CAG and can extend its effective range.
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TD
KV["Full KV-Cache"] --> EV["Eviction Strategies"]
KV --> QZ["Quantization"]
KV --> AR["Architectural"]
EV --> H2O["H2O: Heavy Hitters<br/>+ Recent Tokens"]
EV --> SL["StreamingLLM:<br/>Sink Tokens + Window"]
EV --> SK["SnapKV:<br/>Attention-Based Selection"]
QZ --> KI["KIVI: Per-Channel<br/>INT4/INT2"]
QZ --> KQ["KVQuant: Outlier-Aware"]
AR --> MQA["MQA: Single KV Head"]
AR --> GQA["GQA: Grouped KV Heads"]
AR --> MLA["MLA: Latent<br/>Compression"]
classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
class KV blue
class EV,QZ,AR purple
class H2O,SL,SK teal
class KI,KQ amber
class MQA,GQA,MLA teal
[IMAGE: Bar chart comparing KV-cache memory per token across attention variants: MHA (full), GQA-8 (1/8 KV heads), GQA-4 (1/4 KV heads), MQA (single KV head), with memory in MB on the y-axis for a 70B-class model]
By the Numbers
The CAG paper (Chan et al., 2025, Don't Do RAG, arXiv:2412.15605) evaluated against sparse RAG (BM25) and dense RAG (OpenAI embeddings) on two standard QA benchmarks using Llama 3.1 8B Instruct with a 128k context window.
Accuracy (BERTScore)
| Dataset | Split | Tokens | CAG | Sparse RAG (best top-k) | Dense RAG (best top-k) |
|---|---|---|---|---|---|
| HotpotQA | Small | 21k | 0.7759 | 0.7549 (k=5) | 0.7516 (k=10) |
| HotpotQA | Medium | 43k | 0.7696 | 0.7619 (k=3) | 0.7464 (k=3) |
| HotpotQA | Large | 85k | 0.7527 | 0.7495 (k=5) | 0.7426 (k=3) |
| SQuAD | Small | 21k | 0.8265 | 0.8191 (k=10) | 0.8035 (k=10) |
| SQuAD | Medium | 32k | 0.7512 | 0.7471 (k=3) | 0.7350 (k=10) |
| SQuAD | Large | 50k | 0.7640 | 0.7548 (k=10) | 0.7499 (k=10) |
CAG outperforms both RAG variants on every split. The margin is widest on small knowledge bases, where the retriever's top-k selection most aggressively discards relevant context.
Latency (Seconds per Query)
| Dataset | Split | CAG (with cache) | Without cache | Speedup |
|---|---|---|---|---|
| HotpotQA | Small | 0.85 | 9.25 | 10.8x |
| HotpotQA | Medium | 1.66 | 28.82 | 17.4x |
| HotpotQA | Large | 2.33 | 94.35 | 40.5x |
| SQuAD | Small | 1.07 | 10.30 | 9.6x |
| SQuAD | Medium | 1.73 | 13.36 | 7.7x |
| SQuAD | Large | 2.41 | 31.08 | 12.9x |
The speedup grows with context length because the prefill cost (avoided by CAG) scales linearly with sequence length, while cache loading from disk scales with cache file size, which benefits from sequential I/O.
KV-Cache Memory Formula
For planning a CAG deployment, the cache size in bytes for a given model and document length:
\[M_{\text{cache}} = 2 \times L \times h_{\text{kv}} \times d_k \times T \times b\]| Model | \(L\) | \(h_{\text{kv}}\) | \(d_k\) | Per-Token (BF16) | 50k Tokens |
|---|---|---|---|---|---|
| Llama 3.1 8B | 32 | 8 (GQA) | 128 | 128 KB | 6.1 GB |
| Llama 2 70B | 80 | 64 (MHA) | 128 | 2.5 MB | 122 GB |
| Llama 3.1 70B | 80 | 8 (GQA) | 128 | 320 KB | 15.3 GB |
GQA's 8x reduction in KV heads directly translates to 8x smaller caches, making CAG practical on consumer hardware for GQA models where it would be infeasible for MHA models of similar size.
[IMAGE: Log-scale line plot of KV-cache size vs. sequence length for three model architectures (MHA-70B, GQA-70B, GQA-8B), with a horizontal line marking 24GB VRAM (consumer GPU limit) and 80GB VRAM (A100), showing where each model hits the memory wall]
A Concrete Example
Consider a corporate IT helpdesk with 200 internal policy documents totaling 40,000 tokens. The team deploys CAG on Llama 3.1 8B.
Step 1: Preloading. All 200 documents are concatenated with simple separators and fed through the model. The prefill pass processes 40,000 tokens, producing the KV-cache: $2 \times 32 \times 8 \times 128 \times 40000 \times 2 = 5.24$ GB. This takes approximately 45 seconds on a single A100 GPU. The cache is serialized to an NVMe SSD.
Step 2: First query arrives. An employee asks: "What is the policy for requesting extended parental leave?" The system loads the 5.24 GB cache from SSD to GPU memory (approximately 0.5 seconds over PCIe Gen4), appends the 15 query tokens, and generates a response. Total latency: roughly 1.2 seconds. Without the cache, the model would need to process all 40,015 tokens from scratch, taking approximately 12 seconds.
Step 3: Second query, same cache. Another employee asks about the VPN setup procedure. The cache is already in GPU memory. Only the new query tokens (12 tokens) need prefill. Latency: under 0.8 seconds. The cache reset from the previous query simply truncates the last response's KV entries.
Step 4: Knowledge update. A new remote-work policy is published. The administrator appends it to the document set (now 41,200 tokens), re-runs the prefill pass (approximately 47 seconds), and replaces the cached file. Total downtime for updates: under a minute, with no embedding pipeline, no re-indexing, and no chunking strategy to revisit.
Compare this to the RAG equivalent: the same update requires re-chunking the new document, computing embeddings, inserting into a vector database, and hoping the retriever surfaces the new policy correctly for semantically related queries.
[IMAGE: Annotated timeline of a single CAG query lifecycle: disk read (0.5s), token append (negligible), decode (0.7s), reset (negligible), with total wall-clock time marked, compared to a RAG timeline showing embed (0.1s), search (0.2s), rank (0.05s), prefill (3s), decode (0.7s)]
Where It Breaks
CAG is not a universal replacement for RAG. Its failure modes are specific and predictable.
Context window ceiling. The hard constraint is the model's maximum context length. Llama 3.1's 128k tokens translates to roughly 250-300 pages of text. Enterprise knowledge bases routinely exceed this. A legal corpus, a full product documentation set, or a medical literature collection will not fit. There is no graceful degradation; you either fit or you don't.
Attention degradation over long contexts. Even within the context window, transformer attention is not uniform. The "lost in the middle" phenomenon (Liu et al., 2023, arXiv:2307.03172) shows that models retrieve information from the beginning and end of the context more reliably than from the middle. For a 85k-token CAG deployment, documents placed in positions 30k-60k may receive less effective attention, degrading answer quality without any obvious signal.
Stale knowledge. CAG's cache is a snapshot. If the underlying documents change frequently (hourly news feeds, live inventory data, real-time pricing), the cache must be rebuilt, and the prefill cost is paid again. For knowledge bases that update more than a few times per day, the amortization arithmetic favors RAG's per-query freshness.
Multi-tenant isolation. Each tenant with different access permissions needs a separate cache. Ten tenants with 40k-token knowledge bases on Llama 3.1 8B requires 52 GB of cache storage and, if served concurrently, 52 GB of GPU memory. RAG's per-query retrieval naturally supports access control by filtering at the retrieval layer.
No relevance signal. RAG's retriever provides a relevance score that can be used for confidence estimation ("the top result scored 0.92, so the answer is likely grounded"). CAG provides no such signal. The model sees everything and must internally decide what is relevant, with no external check.
[IMAGE: Heat map visualization of attention weights over a 50k-token context, showing high attention at positions 0-2k and 48k-50k, with a noticeable trough in the 20k-35k range, illustrating the "lost in the middle" effect]
Alternative Designs
| Approach | Retrieval Needed | Upfront Cost | Per-Query Latency | Handles Updates | Scales Past Context Window |
|---|---|---|---|---|---|
| CAG | No | High (prefill) | Very low | Requires rebuild | No |
| RAG (sparse) | Yes (BM25) | Low (index) | Medium | Incremental | Yes |
| RAG (dense) | Yes (embeddings) | Medium (embed all) | Medium-high | Re-embed changed docs | Yes |
| Long-context (no cache) | No | None | Very high | Immediate | No |
| Hybrid CAG-RAG | Selective | Medium | Low | Partial rebuild | Yes (via fallback) |
| Fine-tuning | No | Very high (training) | Lowest | Requires retraining | Yes (in parameters) |
Long-context without caching is the naive baseline: stuff all documents into the prompt on every query. This gives CAG's accuracy without its speed. The 40.5x speedup on HotpotQA-Large is entirely the cache's contribution.
Hybrid CAG-RAG (Agrawal and Kumar, 2025, arXiv:2505.08261) preloads a stable core knowledge base via CAG and uses selective retrieval only for queries that require information beyond the cached context. This extends CAG's effective range while preserving its speed for the common case.
Fine-tuning bakes knowledge into the model's parameters, eliminating both retrieval and context loading. But it requires training runs measured in GPU-hours, does not support rapid updates, and risks catastrophic forgetting of the base model's capabilities. CAG occupies a middle ground: knowledge is external (like RAG) but pre-processed (like fine-tuning).
How It Is Used in Practice
API-Level Prefix Caching
The production deployment of CAG principles is already widespread, though not always under the CAG name. Anthropic, OpenAI, Google, and DeepSeek have each implemented prefix caching in their inference APIs. The mechanism is identical in spirit to CAG: if two requests share the same prompt prefix, the KV-cache for that prefix is computed once and reused.
Anthropic reports 90% cost reduction and 85% latency reduction for long shared prefixes. Google's Gemini context caching offers explicit TTL-controlled caches with 75% input token cost reduction, with implicit caching (automatic prefix matching) enabled by default since May 2025.
Self-Hosted Inference Frameworks
vLLM's PagedAttention (Kwon et al., 2023, Efficient Memory Management for LLM Serving, arXiv:2309.06180) brought OS-style virtual memory to KV-cache management. Existing systems were wasting 60-80% of KV-cache memory to fragmentation and over-reservation; PagedAttention's block-based allocation improved throughput by 2-4x. vLLM now ships with automatic prefix caching enabled by default.
SGLang's RadixAttention extends this to tree-structured prefix sharing, where multiple prompts that diverge from a common prefix share cache up to the divergence point. This is particularly useful for few-shot prompting scenarios where the same examples prefix many different queries.
Where CAG Fits in Production
The natural deployment targets for full CAG (not just prefix caching) are bounded knowledge domains: customer support for a specific product line, internal documentation Q&A, compliance checking against a fixed regulatory corpus, or medical reference lookup within a specialty's guidelines. These share three properties: the knowledge base is smaller than the context window, it updates infrequently, and query latency matters.
[IMAGE: Architecture diagram of a production CAG deployment showing: document ingestion pipeline feeding into a periodic prefill worker, KV-cache stored on NVMe SSD, served by a GPU inference server with cache-hot and cache-cold paths, with a cache invalidation webhook from the CMS]
Insights Worth Remembering
-
CAG is not an optimization of RAG; it is a different architecture. RAG searches then reads. CAG reads once, then answers forever. The failure modes, scaling characteristics, and operational requirements share almost nothing.
-
The KV-cache is the model's working memory, not just a performance trick. Every token the model has processed lives in the cache as a key-value pair. Loading a precomputed cache is equivalent to giving the model perfect recall of a document it read earlier.
-
GQA made CAG practical. Moving from 64 KV heads (MHA) to 8 KV heads (GQA) reduced Llama's KV-cache by 8x. A 50k-token cache for Llama 3.1 70B fits in 15 GB instead of 122 GB. Without GQA, the memory cost would limit CAG to small models or short documents.
-
The speedup grows with context length. CAG's advantage is proportional to the prefill cost it avoids. Short contexts see modest speedups (under 10x); 85k-token contexts see 40x. This means CAG is most valuable precisely where it is most expensive to deploy.
-
Cache reset is what makes CAG a serving strategy, not just a batch trick. The ability to truncate appended tokens and revert to the document-only cache state enables request-level reuse without rebuilding.
-
Prefix caching at the API level is CAG without the name. When Anthropic caches your system prompt's KV tensors across API calls, that is cache-augmented generation. The research paper formalized a pattern that production systems had already discovered.
-
The "lost in the middle" problem means document ordering matters. In a CAG deployment, placing the most frequently queried documents at the beginning and end of the concatenated context can measurably improve answer quality, an operational detail that RAG does not require.
-
Hybrid CAG-RAG is likely the long-term equilibrium. Pure CAG for the stable knowledge core, selective retrieval for the long tail and dynamic content. This captures the latency benefits for the common case without hitting the context window ceiling.
Open Questions
Can KV-cache compression extend CAG's effective range without quality loss? Quantizing cached KV pairs to INT4 or INT2 (as in KIVI) could reduce memory by 4-8x, potentially fitting 200k+ tokens of knowledge into a single cache. Early results from H2O and StreamingLLM show that selective eviction can maintain quality for generation tasks, but their impact on the retrieval-like access patterns CAG requires (attending to arbitrary positions across a long cached context) has not been systematically measured.
Will sparse attention architectures change the equation? Models that attend to subsets of the cached context (Longformer-style local+global patterns, or learned sparse attention) could reduce the per-query attention cost from \(O(T)\) to \(O(\sqrt{T})\) or \(O(\log T)\), making very long CAG contexts computationally feasible even without eviction.
How should multi-turn conversations interact with a persistent KV-cache? The current CAG formulation resets between queries. But a conversation that builds on previous answers could benefit from retaining some query-response context across turns while keeping the document cache intact. The truncation strategy becomes more complex.
What is the right cache invalidation strategy for slowly-evolving knowledge bases? When 5% of documents change, must the entire cache be rebuilt? Partial cache updates, where only the KV entries corresponding to changed document positions are recomputed, are theoretically possible but require tracking the mapping between document positions and cache positions. No current system implements this efficiently.
Can CAG work across multiple context windows through cache chaining? If a knowledge base requires 300k tokens but the model supports 128k, three overlapping CAG caches with a routing layer could cover the full corpus. This introduces a lightweight retrieval step (which cache to load) without the full RAG pipeline, though it adds latency for cache swapping.
[IMAGE: Decision flowchart for choosing between CAG, RAG, and hybrid approaches: starts with "Knowledge base size vs. context window", branches on update frequency, query latency requirements, and multi-tenancy needs, ending at recommended architecture for each combination]
Sources and Further Reading
Foundational Papers
- Chan et al., 2025, Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks, arXiv:2412.15605 - The paper that formalized CAG as a named paradigm, with benchmarks against RAG on SQuAD and HotpotQA.
- Lewis et al., 2020, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, arXiv:2005.11401 - The original RAG paper introducing the retriever-generator architecture.
- Vaswani et al., 2017, Attention Is All You Need, arXiv:1706.03762 - The Transformer paper that established the self-attention mechanism and the Q/K/V decomposition underlying all KV-cache work.
Important Follow-up Work
- Agrawal and Kumar, 2025, Enhancing Cache-Augmented Generation with Adaptive Contextual Compression for Scalable Knowledge Integration, arXiv:2505.08261 - Introduces adaptive compression and hybrid CAG-RAG frameworks.
- Kwon et al., 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, arXiv:2309.06180 - PagedAttention and vLLM; OS-inspired paged memory management for KV-caches. Best Paper at SOSP 2023.
- Ainslie et al., 2023, GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, arXiv:2305.13245 - Grouped-Query Attention, which made large-model KV-caches practical.
- Shazeer, 2019, Fast Transformer Decoding: One Write-Head is All You Need, arXiv:1911.02150 - Multi-Query Attention, the first architectural reduction of KV-cache size.
- Liu et al., 2023, Lost in the Middle: How Language Models Use Long Contexts, arXiv:2307.03172 - Demonstrates attention degradation in the middle of long contexts.
KV-Cache Compression
- Xiao et al., 2024, Efficient Streaming Language Models with Attention Sinks, arXiv:2309.17453 - StreamingLLM: attention sink tokens plus a sliding window for infinite-length generation.
- Zhang et al., 2024, H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models, arXiv:2306.14048 - Heavy-hitter based KV-cache eviction achieving 29x throughput improvement.
Technical Blogs and Resources
- vLLM Blog, 2023, Easy, Fast, and Cheap LLM Serving with PagedAttention - Practical guide to PagedAttention deployment.
- BentoML, LLM Inference Handbook: Prefix Caching - Reference on prefix caching mechanics in serving frameworks.