How Prompt Caching Actually Works (And Why Yours Isn't)
Prompt caching can cut LLM cost by ~90% on repeated context — but only if you understand the prefix-match rule. Here are the silent invalidators that quietly drop your cache hit rate to zero.
The first time prompt caching pays off, it feels like cheating. A request that cost a few cents drops to a fraction of a cent, the latency falls off a cliff, and you didn't change a word of the prompt. The second time you look, the cache hit rate is zero and you have no idea why.
I've debugged this enough times — in my own agents and in other people's — that I can usually name the culprit before I see the code. Almost always it's the same class of mistake, and it follows from one rule that nobody internalizes the first time they read it.
The one rule everything follows from
Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it.
That's it. That's the whole thing. The cache key is derived from the exact bytes of your rendered prompt up to each cache breakpoint. A single byte difference at position N invalidates the cache for every breakpoint at or after position N.
The order things render in is fixed: tools, then system, then messages. So a cache breakpoint on the last system block covers your tools and your system prompt together. A breakpoint at the end of a conversation turn covers everything up to that turn.
The economics are why you care. A cache read costs roughly a tenth of the base input price. A cache write costs a bit more than a normal request — about 1.25x for the default 5-minute time-to-live, 2x for the 1-hour TTL. So the break-even is fast: with the 5-minute TTL, you come out ahead by the second request. With the 1-hour TTL you need three. After that, every hit is near-free.
The silent invalidators
Here's what makes this hard: when caching breaks, nothing errors. You don't get an exception. You get a bigger bill and a slower app, and the only signal is a number in the usage object that you have to go looking for. So the failure mode is invisible unless you instrument for it.
These are the patterns I grep for the moment someone says "caching isn't working." Every one of them puts something that changes per-request into the cached prefix:
| Pattern | Why it breaks caching |
|---|---|
datetime.now() in the system prompt | The prefix changes every single request |
| A UUID or request ID early in the content | Same — every request is byte-unique |
json.dumps(d) without sorted keys | Non-deterministic serialization; bytes differ run to run |
| Session or user ID baked into the system prompt | Per-user prefix; nothing shares across users |
Conditional system sections (if flag: system += ...) | Every flag combination is a distinct prefix |
| A tool list built per user | Tools render at position 0; nothing caches across users |
The nastiest one is the clock. Someone helpfully adds "Current date and time: ..." to the top of the system prompt so the model knows what day it is. Reasonable instinct, catastrophic placement. That timestamp sits at the front of the prefix, so it changes on every request, so nothing behind it can ever cache — the entire system prompt, the entire tool list, all of it, re-billed at full price forever.
The fix is never to delete the dynamic thing. The model often genuinely needs the date. The fix is to move it. Stable content goes first; volatile content goes last. Put the date in a message at the end of the conversation, not in the system prompt at the front. A value at turn 5 invalidates nothing before turn 5.
Where to put the breakpoint
Assuming your prefix is actually stable, you still have to mark where to cache. A few patterns cover almost everything.
A big shared system prompt. Put one breakpoint on the last system block. Because tools render before system, that single marker caches tools and system together. Done.
A multi-turn conversation. Put the breakpoint on the last content block of the most recently added turn. Each new request reuses the entire prior conversation as a cached prefix, and the hits accrue as the conversation grows.
A shared preamble with a varying question — few-shot examples or retrieved documents that are identical across requests, followed by a different question each time. This is the one people get backwards. Put the breakpoint at the end of the shared part, not at the end of the whole prompt. If you cache the whole thing including the varying question, every request writes a brand-new cache entry and reads nothing. The marker goes on the boundary between fixed and variable.
A prompt that's different from the first token. Don't cache at all. If the opening of every request differs, there's no reusable prefix, and adding a breakpoint just charges you the write premium for zero reads.
A few hard limits to keep in mind: you get at most 4 breakpoints per request. The minimum cacheable prefix is model-dependent — on Claude Opus 4.8 it's about 4,096 tokens, while Sonnet 4.6 and Fable 5 cache from around 2,048. Below that floor, the marker is silently ignored and nothing caches. And each breakpoint only looks back about 20 content blocks for a prior entry, which matters in agentic loops: a single turn that adds more than 20 tool-call blocks can blow past the lookback window, so drop an intermediate breakpoint every dozen-or-so blocks in long turns.
Verify, don't assume
You cannot eyeball whether caching is working. You have to read the usage numbers. Every response reports three:
cache_creation_input_tokens— tokens written to the cache this request (you paid the write premium)cache_read_input_tokens— tokens served from the cache (you paid roughly a tenth)input_tokens— the uncached remainder, at full price
The diagnostic is simple: if cache_read_input_tokens is zero across repeated requests with what should be an identical prefix, a silent invalidator is at work. Diff the actual rendered bytes of two consecutive requests and the culprit shows up — a timestamp, a reordered key, a tool that came and went.
One subtlety that trips people: input_tokens is only the uncached part. If your agent ran for an hour and input_tokens reads 4,000, don't celebrate a tiny prompt — the real size is the sum of all three fields, and the rest was served from cache. Check the sum, not the single number.
The architectural decisions that matter more than markers
Marker placement is the easy 20%. The decisions that actually determine your hit rate are structural:
- Freeze the system prompt. No interpolated dates, modes, or user names at the front. Inject dynamic context later, as a message.
- Don't change tools or model mid-conversation. Tools render at position 0, so adding, removing, or reordering even one tool invalidates the entire cache. Switching models does too — caches are per-model. If you need "modes," don't swap the tool set; pass the mode as content.
- Serialize deterministically. Sort your JSON keys. Iterating a hash set into the prompt produces different byte orders on different runs, and the cache can't tell that's semantically identical.
- Make forked calls reuse the parent's exact prefix. Summarization passes, sub-agents, compaction — if they rebuild the system or tools with any difference, they miss the parent's cache entirely. Copy the prefix verbatim, append fork-specific content at the end.
Get the architecture right and most caching works for free. Get it wrong and no amount of cache_control markers will save you — you'll be sprinkling breakpoints over a prefix that changes every request and wondering why the bill never moves.
The one-sentence version
Prompt caching is a prefix match, so stable content goes first and anything that changes per request goes last — and if your hit rate is zero, there's a clock or a UUID or an unsorted dict sitting at the front of your prefix.
FAQ
What is prompt caching?
Prompt caching lets an LLM API reuse the processed form of a repeated prompt prefix across requests, charging roughly a tenth of the input price for the cached portion. On repeated context it can cut input cost by around 90% and sharply reduce latency.
Why is my prompt cache hit rate zero?
Almost always because something that changes per request sits in the cached prefix — a datetime.now() call in the system prompt, a UUID or request ID near the top, JSON serialized without sorted keys, or a tool list that varies per user. Caching is a prefix match, so any byte change before the breakpoint invalidates everything after it.
How much does prompt caching cost?
Cache reads cost roughly 0.1x the base input price. Cache writes cost about 1.25x for the default 5-minute time-to-live and 2x for the 1-hour TTL. With the 5-minute TTL you break even on the second request; with the 1-hour TTL, the third.
Where should I put the cache breakpoint?
On the last block of the stable portion of your prompt. For a shared system prompt, the last system block (which also covers tools). For a multi-turn conversation, the last block of the newest turn. For a shared preamble plus a varying question, the boundary between the two — not the end of the whole prompt.
How do I verify caching is working?
Read the usage fields on the response: cache_read_input_tokens (served from cache), cache_creation_input_tokens (written this request), and input_tokens (uncached). If the read count stays zero across identical-prefix requests, diff the rendered bytes to find the invalidator.
Does changing tools or the model break the cache?
Yes. Tools render at the very front of the prefix, so adding, removing, or reordering any tool invalidates the whole cache. Switching models also invalidates it, because caches are scoped per model.
Is there a minimum size for caching?
Yes, and it's model-dependent. On Claude Opus 4.8 the minimum cacheable prefix is about 4,096 tokens; on Sonnet 4.6 and Fable 5 it's around 2,048. Below the floor the breakpoint is silently ignored and nothing caches.
Build log
Get an email when I ship a new prototype or essay. No funnel — just the work.