Prompt Caching: A Practical Guide for Production LLM Engineering

If you're shipping LLM-powered features at scale, prompt caching is one of the highest-leverage optimizations available to you. It can cut latency by 50–80% and reduce input token costs by up to 90% on repeated prefixes — without changing your model, your prompts, or your output quality. But it has sharp edges: a short TTL, a minimum token threshold, and ordering rules that make or break cache hits.

Here's what engineers building production systems need to know.

What Prompt Caching Actually Does

When you send a request to an LLM, the model processes every token in your prompt from scratch — tokenization, embedding, and a forward pass through the attention layers. For a 10,000-token system prompt, that work is identical on every call.

Prompt caching tells the provider: "store the internal representation of this prefix, and reuse it next time I send the same prefix." On a cache hit, the model skips the expensive prefill for the cached portion and only processes the new tokens after the cached boundary.

The result:
- Cheaper input tokens on the cached portion (typically 10% of the normal rate on read, with a small write premium on the first call).
- Lower TTFT (time-to-first-token), because prefill is the dominant latency contributor for long prompts.

The 5-Minute TTL

Most prompt caches use a 5-minute sliding TTL. Each cache hit refreshes the window; if no request touches the cache entry for 5 minutes, it evicts.

This has real implications for system design:

Bursty traffic loses. A user who returns 10 minutes later pays full price again. Workloads with steady throughput benefit far more than sporadic ones.
Warm-up requests matter. For latency-sensitive endpoints, consider issuing a synthetic "keep-alive" request every 4 minutes to keep hot prompts resident.
Multi-region routing breaks caching. Caches are typically per-region or per-server. If your load balancer rotates regions, your hit rate craters. Pin sessions or accept the loss.
Some providers offer extended TTL (e.g., 1-hour caching on Anthropic at a higher write cost). Use this for prompts that are stable but accessed infrequently — large reference documents, schema definitions, or codebases used in agent workflows.

Rule of thumb: if your prompt is accessed more than once every 5 minutes, standard TTL works. If access is sparse but the prompt is large, pay for extended TTL.

The 1024-Token Minimum

Caching isn't free to set up. Providers enforce a minimum cacheable prefix — typically 1024 tokens for larger models (Claude Sonnet/Opus, GPT-4 class) and 2048 for some smaller variants. Below that threshold, the cache breakpoint is silently ignored and you pay full price.

This shapes how you structure prompts:

[ SYSTEM PROMPT          ~800 tokens ]   ← too small to cache alone
[ TOOL DEFINITIONS       ~600 tokens ]
[ FEW-SHOT EXAMPLES    ~2,000 tokens ]
[ RETRIEVED CONTEXT    ~4,000 tokens ]   ← varies per request
[ USER MESSAGE             ~50 tokens ]

If you cache only the system prompt, you miss. Instead, set the cache breakpoint after the largest stable block:

messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": SYSTEM_PROMPT},
            {"type": "text", "text": TOOL_DEFINITIONS},
            {
                "type": "text",
                "text": FEW_SHOT_EXAMPLES,
                "cache_control": {"type": "ephemeral"}
            }
        ]
    },
    {"role": "user", "content": user_query}
]

Now the cached prefix is ~3,400 tokens — comfortably above the threshold — and every subsequent request reuses it.

Cache Keys Are Prefix-Exact

The cache key is the exact token sequence from position 0 up to the breakpoint. Two consequences:

Order matters. Move your stable content (system prompt, tools, examples) to the front. Volatile content (timestamps, user input, retrieved chunks) goes at the end.
Any change invalidates everything after it. A single character edit to your system prompt evicts the cache for every downstream session.

This is why naive RAG patterns often have catastrophic hit rates: people interpolate retrieved chunks into the system prompt, which changes the prefix on every request. Instead, put retrieval results in the user message or after the cache breakpoint.

Multiple Breakpoints

Most APIs support multiple cache breakpoints (Anthropic allows up to 4). Use them to handle layered staleness:

[ SYSTEM PROMPT       ]  ← breakpoint 1 (changes monthly)
[ TOOL DEFINITIONS    ]  ← breakpoint 2 (changes weekly)
[ CONVERSATION HISTORY]  ← breakpoint 3 (changes per turn)
[ NEW USER MESSAGE    ]

On a multi-turn conversation, breakpoint 3 refreshes each turn while breakpoints 1 and 2 stay warm across sessions. This is the standard pattern for chat applications.

Measuring Cache Performance

Don't trust intuition. Instrument these metrics:

Cache hit ratio: cache_read_tokens / (cache_read_tokens + cache_creation_tokens + uncached_input_tokens)
Effective input cost per request
p50/p95 TTFT with and without cache hits

Most providers return per-call cache statistics in the response (usage.cache_read_input_tokens, usage.cache_creation_input_tokens). Pipe these into your observability stack and alert on regressions — a hit ratio drop is usually the first sign that someone modified a system prompt or that traffic patterns shifted.

Common Pitfalls

Dynamic timestamps in system prompts. "Current time: 2024-01-15 14:32:07" evicts the cache every second. Round to the hour or move it after the breakpoint.
Non-deterministic tool serialization. If you JSON-serialize tools with unordered dict keys, byte-level differences will miss the cache. Use a stable serializer.
Caching below the minimum. If a breakpoint sits below 1024 tokens, you pay write costs with no benefit — or worse, no caching happens at all. Verify with the response metadata.
Caching volatile content. Caching a per-user profile in a high-cardinality system means thousands of cold entries. Cache shared content; pass user-specific data after the breakpoint.

Why This Matters

Prompt caching is the closest thing production LLM engineers have to a free lunch — but only if you architect for it. The 5-minute TTL means your traffic shape matters as much as your prompt content; the 1024-token minimum means breakpoint placement is a design decision, not an afterthought; and prefix-exact matching means prompt hygiene determines whether you save 90% or 0%. As models get larger and context windows grow, the gap between teams that treat caching as a first-class concern and teams that don't will show up directly in unit economics and user-perceived latency. Build for cache hits the way you build for database indexes: deliberately, measurably, and from day one.