Prompt caching is the most under-used cost lever on LLM workloads. Every provider ships it, the discounts are enormous (90% off cached reads on Anthropic and Gemini 2.5+, 50% off on OpenAI), and yet most production systems we audit leave half the savings on the table because of placement mistakes, bad TTL choices, or cache-busting prompts. This is a walkthrough of how the three major providers implement caching in 2026, the cost math that actually matters at scale, the placement rules that determine whether you get a 90% discount or a 10% one, and the patterns we use on client projects to hit cache rates above 80% in production.
The three cache models, compared
Anthropic, OpenAI, and Google took different paths to the same feature. Knowing which model you're paying for changes how you structure your prompts.
| Provider | Activation | Cache read discount | Write surcharge | Default TTL |
|---|---|---|---|---|
| Anthropic (Claude) | Explicit cache_control markers | 90% off (0.1× base input) | 1.25× base (5-min) or 2× (1-hr) | 5 min, 1 hr optional |
| OpenAI (GPT-5) | Automatic above 1,024 tokens | 50% off (up to 90% on newer tiers) | None | 5–10 min, undocumented |
| Google (Gemini 2.5) | Implicit auto + explicit API | 75% off (2.0) / 90% off (2.5+) | Storage fee on explicit | Implicit short; explicit configurable |
The upshot: Anthropic gives you the deepest discount and the most control, but you have to think about cache_control placement. OpenAI is the easiest — it just works above 1K tokens — but the discount is shallower. Gemini sits in the middle with the wrinkle that explicit caching has a storage cost, which matters once you're caching large contexts for long periods.
The cost math that matters
The math is simple but the intuition often isn't. Caching charges a write premium once and then gives you a read discount on every subsequent hit within the TTL. You break even on Anthropic's 5-minute cache after two cache hits — one of the cheapest break-even points in any cloud infrastructure. Past that, every hit is pure savings.
A worked example
A customer-support assistant on Claude Sonnet 4.6: 50,000 conversations per day, 10 turns per conversation, a stable 25,000-token system prompt plus knowledge base, and a 200-token user turn. That's 500,000 turns per day, each with 25,200 input tokens and — let's say — 300 output tokens.
- Without caching: 500,000 × 25,200 input tokens = 12.6B input tokens/day. At $3/MTok, that's ~$37,800/day on input alone.
- With caching (first turn writes, nine turns read): 50,000 × 25,000 tokens written at $3.75/MTok = $4,687, plus 450,000 × 25,000 tokens read at $0.30/MTok = $3,375. Input cost drops to ~$8,062/day — a 79% reduction on input spend.
- Output cost is unchanged in both cases, so the end-to-end bill reduction is slightly smaller — typically 60–75% in practice, depending on how output-heavy the workload is.
Output tokens aren't cached. If your response is long and repetitive, structured output (JSON with a tight schema) cuts total spend faster than any cache tuning. Prompt caching is an input-side lever only.
Where to put cache_control — the rules that matter
Anthropic's caching matches on the exact prefix up to and including each cache_control marker. The order of blocks matters: tools, then system, then messages. You get up to four breakpoints per request, and longer-TTL breakpoints must appear before shorter-TTL ones. Violate any of these and your hit rate silently collapses.
// Tool-calling agent with two cache breakpoints
// BP1: stable agent instructions — 1-hour TTL (rarely change)
// BP2: end of turn N-1 — 5-minute TTL (conversation state)
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
tools,
system: [
{
type: "text",
text: AGENT_INSTRUCTIONS,
cache_control: { type: "ephemeral", ttl: "1h" },
},
],
messages: [
...priorTurns.slice(0, -1),
{
...priorTurns.at(-1)!,
// Mark the last prior turn with a 5-min breakpoint
// so the next request reads everything up to here
content: attachCacheControl(priorTurns.at(-1)!.content, "5m"),
},
{ role: "user", content: newUserTurn },
],
});
// On every response, check usage to confirm cache hit
// response.usage.cache_read_input_tokens > 0 → hit
// response.usage.cache_creation_input_tokens > 0 → write- Put the most stable content first — agent instructions, tool definitions, pinned knowledge base. Anything that changes per request goes at the end, after your last cache_control marker.
- Use a 1-hour TTL for prefixes that change less than hourly (system prompts, knowledge bases). Use the 5-minute default for conversation state. Order breakpoints with the longer TTL first.
- Never interpolate a user ID, timestamp, or request ID into the cached prefix. One character difference and the cache misses.
- Verify hits in production. Log cache_read_input_tokens and cache_creation_input_tokens on every response. If your cache hit rate is below 70% on a steady-state workload, something is busting the prefix.
The traps that silently tank cache rates
Most of the caching bugs we find on client projects fall into a small set of predictable traps. Every one of them shows up the same way in metrics: cache_creation_input_tokens is high, cache_read_input_tokens is low, and the bill doesn't drop. Here's the checklist we run through first.
- Dynamic content in the system prompt. A current-date string, a user name, or a request ID interpolated into the system prompt means every request has a different prefix and every request writes a new cache entry.
- Non-deterministic tool ordering. If your tool array is built from an object's values, Node.js and browser engines can return different orders across runs. Sort tool definitions by a stable key before sending.
- Breakpoint after every message. More breakpoints isn't better — each one is a separate cache entry with its own write cost. Two well-placed breakpoints usually beat four sloppy ones.
- 5-minute TTL on high-value stable content. If you're paying to write a 50K-token system prompt and the traffic is bursty, the 1-hour TTL pays off fast. Run the math on your actual traffic pattern, not a steady-state assumption.
- Cache content below the minimum. Anthropic requires a minimum of 1,024 tokens (2,048 on some models) before cache_control is honoured. Short prefixes aren't cached, even if you mark them.
Stacking caching with batching
Batch APIs give you roughly 50% off for latency-tolerant workloads — overnight embeddings, bulk classification, backfills. Anthropic's batch and cache discounts stack. A workload that can use both — say, a nightly re-classification run against a stable prompt — can land at 95% off input token cost versus the naive baseline. That's not a theoretical ceiling; that's what we see on actual client invoices for well-engineered batch pipelines.
Batch + cache is the lowest-hanging fruit on any backlog workload. If you have a queue of documents to process and you haven't wired this up, it's usually a half-day of work for a 5–10× cost reduction.
When caching hurts you
Caching is net-negative in a small but real set of situations. If your prompt prefix is genuinely different on every request — heavy personalisation, per-request retrieval where the retrieved chunks change every turn, one-off completions — you're paying the write surcharge on every call and getting no reads in return. The breakeven on Anthropic's 5-minute cache is two hits, but if you only ever get one, caching costs you 25% more than not caching.
- If over 30% of your requests are the first-ever request for a given prefix, review whether caching is helping. Add caching to the stable part (system prompt, tool definitions) and leave the dynamic part uncached.
- If TTLs are timing out between requests on a low-traffic endpoint, either batch requests within the TTL window or drop to the 1-hour tier.
- If the prompt is below the minimum size, there's nothing to optimise — caching won't activate at all.
A migration playbook
When we take over an LLM workload that isn't caching yet, the playbook is the same every time. It takes a day or two and usually returns 50–80% of the input token spend on the first pass.
- Instrument first. Log cache_read_input_tokens and cache_creation_input_tokens on every response. You can't tune what you can't see.
- Identify stable prefixes. System prompts, tool definitions, pinned knowledge — anything that doesn't change per request. Move them to the top of the prompt if they aren't already.
- Eliminate accidental dynamism. Strip timestamps, UUIDs, user-specific data from the stable section. If you need to include user context, put it after the last cache_control marker.
- Add cache_control markers. Start with one at the end of the system prompt, and — for multi-turn flows — one at the end of the prior message. Resist the urge to add more until the first pass is proven.
- Verify. Cache hit rate on steady-state traffic should be north of 80%. If it isn't, something in the prefix is varying and you haven't found it yet.
Key takeaways
- Caching is the single biggest cost lever on LLM workloads. A 60–80% input-spend reduction is typical; 90%+ is achievable on RAG and agent workloads.
- Anthropic gives the deepest discount and the most control; OpenAI is automatic but shallower; Gemini 2.5 sits between. Pick the provider based on workload, not on caching alone.
- cache_control placement is rule-driven, not heuristic. Stable content first, longer TTL before shorter, never interpolate dynamic values into the cached prefix.
- Verify in production with cache_read and cache_creation counters. Cache hit rate below 70% on steady-state traffic means something is silently cache-busting.
- Stack caching with batch APIs for async workloads — 95% off input token cost is realistic, not theoretical.