Prompt Caching Infrastructure

A typical RAG chat request sends the same 4,000-token system prompt and the same 8,000 tokens of retrieved context to the model on every turn. The model recomputes the KV cache for all of it, every time. Prompt caching makes that absurdity go away: compute the prefix once, persist its KV cache, and on subsequent requests with the same prefix, the model resumes from where the cache left off. The cost reduction is 70-90% on the cached portion, and the latency win is the same shape (you skip the prefill for cached tokens).

What "the cache key" actually is

The shared design across providers: the cache key is the exact byte prefix of the prompt up to the caching boundary. One byte different, one extra space, one reordered system message - cache miss. The cache is content-addressed by hashing the token sequence, sometimes block by block, sometimes as the full prefix.

This has two practical consequences:

Order your prompt from most-stable to least-stable. System prompt -> tool definitions -> RAG context -> conversation history -> current user turn. Anything that changes invalidates everything after it.
Do not interpolate dynamic values into the cacheable prefix. A timestamp in the system prompt is the most common foot-gun. Once a day at midnight, your hit rate goes to zero and your bill triples.

The three providers

Anthropic prompt caching

Anthropic exposes caching via cache_control markers on content blocks. You explicitly mark up to four breakpoints in the prompt; everything before each marker is cached as a separate entry.

Default TTL: 5 minutes, optional 1-hour TTL with cache_control: {type: "ephemeral", ttl: "1h"}.
Cache writes cost 25% more than base input tokens (5-min TTL) or 2x (1-hour TTL). Cache reads cost 10% of base input tokens. So a cache hit is a 10x cost reduction on the cached portion.
Minimum cacheable length: 1,024 tokens for Sonnet/Opus, 2,048 for Haiku.
Workspace-level isolation so one tenant's cache cannot be read by another.
Pre-warming is supported via max_tokens: 0 requests.

The mental model: pay 1.25x once to write a 5-min cache; pay 0.1x to read from it. Break-even after ~3 reads. If your prefix gets reused more than 3 times in 5 minutes, caching pays. In practice, multi-turn chat hits this trivially.

OpenAI prompt caching

OpenAI's model is automatic and opaque: any prompt over 1,024 tokens is eligible, the system hashes the prefix in chunks, and you get a cached_tokens field in the usage response telling you how many were hits. No explicit markup. Cached tokens are billed at 50% of standard input price (some models 25%). TTL is unspecified but typically 5-10 minutes of inactivity.

The trade-off vs Anthropic: less control, less savings (2x vs 10x reduction), but zero integration work. If you do nothing differently, you start getting cache hits.

What "the cache key" actually is

The three providers

Anthropic prompt caching

OpenAI prompt caching

Keep reading with Pro.