Prompt caching is not just a cost optimisation - it changes what apps you can build
May 31, 2026 · 4 min read
Anthropic introduced prompt caching in 2024. The headline pitch was "save 90% on input tokens for cached content." Most teams I see treat this purely as a billing line item. They flip it on, watch their costs drop, and stop thinking about it. That is a mistake. The bigger story is what kind of product you can now ship.
The mechanics
Prompt caching lets you mark portions of your prompt as cacheable. The first call computes everything; subsequent calls within the cache TTL (5 minutes for Anthropic at the time of writing) skip the recomputation of the cached portion. The cached input tokens cost ~10% of normal input tokens.
The minimum cache size is 1024 tokens. The cache is keyed by exact content. Order matters - the cached block must come at the start.
That is it. The mechanics are simple. The implications are not.
What you used to optimise away
Before prompt caching, every API call ate full input cost. So you optimised system prompts down to a few hundred tokens. You stripped your retrieval results to bare minima. You cached completions outside the model. The product patterns you avoided were the ones that needed lots of context.
That avoidance came with hidden costs:
- Behaviour drift. Short system prompts left more room for the model to make wrong choices.
- Reduced grounding. Pruning retrieved context to save tokens hurt answer quality.
- Stateless UX. Long-running chats lost their personality between turns.
You absorbed these as inherent costs of using LLMs. They were actually the consequence of a billing model.
What you can build now
With caching, you can put 5,000-50,000 tokens of stable context at the top of every call for a tenth of the cost. That budget lets you ship product patterns that were previously latency- or cost-prohibitive:
Persistent agents with rich personalities. A 10,000-token system prompt that defines the agent's voice, refusal behaviour, citation style, and worked examples. Every turn benefits from the full specification. Cached.
Mentor systems with full curriculum context. Ship an AI tutor that "knows" your entire course library at every turn. Drop the curriculum into the cached system prompt. The model can cite any concept, any exercise, any prerequisite without retrieval.
Long-document Q&A. Cache the document. Every follow-up question runs against the full document for cents instead of dollars.
Code copilots that see your whole repo. Cache the project structure, key files, and conventions. Every code question gets the full context.
Multi-step agents. The agent loop fires multiple calls per user task. Each call reuses the cached agent definition. Loop cost drops 60-90% compared to non-cached.
The pattern: cache the slow-changing, vary the fast-changing
The mental model that helps: split every prompt into a slow-changing prefix and a fast-changing tail.
- Slow-changing prefix. System prompt, persona, examples, large retrieved documents, project context. Cache it.
- Fast-changing tail. The user's actual question, the latest conversation turn. Leave it uncached.
When you design a feature, ask: what is the largest prefix I can fix per user / per session / per task? That is your cache boundary. Every token you move into the prefix is a token you can afford to be more thorough with.
The trap: cache invalidation
Caches expire (5 minutes is short for some workflows) and they invalidate on edits to the prefix. Two consequences:
- Keep the prefix stable. If you generate the system prompt dynamically each call, you defeat the cache. Move computed content out of the cached block.
- Warm caches proactively for hot users. A background job that sends a no-op call every 4 minutes to keep the cache warm is cheaper than paying full input cost on every cache miss.
The 5-year angle
I expect prompt caching to deepen rather than disappear. Some likely trajectories:
- Longer TTLs (hour scale, day scale for stable enterprise content).
- Persistent caches that survive across sessions and devices.
- Tiered caching with deeper discounts for less-frequently-accessed cached content.
- First-class cache management APIs.
When any of these land, the product patterns I described above get even more attractive. Teams that built around aggressive caching will ship richer features at lower cost than teams that did not.
What to do this week
Audit one feature you ship today that uses an LLM. Identify the slow-changing portion of every prompt it generates. Move that portion to the top, mark it cacheable, and measure the cost reduction.
Then ask the harder question: what feature did you not ship because it was too expensive? Could you ship it now?
The default disposition of LLM engineering has been "save tokens." Prompt caching is an invitation to flip the default. Spend tokens. Spend them on context, on persona, on grounding. Make the prefix as rich as the product wants. The cache will absorb the cost.