AI9 min

Prompt caching strategies that cut Claude API bills by 70%

Deep dive on prompt caching across Anthropic, OpenAI, and Gemini — cache-hit math, cache_control placement, 5-minute TTL management, and worked cost examples.

Prompt caching is the most under-used cost lever on LLM workloads. Every provider ships it, the discounts are enormous (90% off cached reads on Anthropic and Gemini 2.5+, 50% off on OpenAI), and yet most production systems we audit leave half the savings on the table because of placement mistakes, bad TTL choices, or cache-busting prompts. This is a walkthrough of how the three major providers implement caching in 2026, the cost math that actually matters at scale, the placement rules that determine whether you get a 90% discount or a 10% one, and the patterns we use on client projects to hit cache rates above 80% in production.

The three cache models, compared

Anthropic, OpenAI, and Google took different paths to the same feature. Knowing which model you're paying for changes how you structure your prompts.

ProviderActivationCache read discountWrite surchargeDefault TTL
Anthropic (Claude)Explicit cache_control markers90% off (0.1× base input)1.25× base (5-min) or 2× (1-hr)5 min, 1 hr optional
OpenAI (GPT-5)Automatic above 1,024 tokens50% off (up to 90% on newer tiers)None5–10 min, undocumented
Google (Gemini 2.5)Implicit auto + explicit API75% off (2.0) / 90% off (2.5+)Storage fee on explicitImplicit short; explicit configurable

The upshot: Anthropic gives you the deepest discount and the most control, but you have to think about cache_control placement. OpenAI is the easiest — it just works above 1K tokens — but the discount is shallower. Gemini sits in the middle with the wrinkle that explicit caching has a storage cost, which matters once you're caching large contexts for long periods.

The cost math that matters

The math is simple but the intuition often isn't. Caching charges a write premium once and then gives you a read discount on every subsequent hit within the TTL. You break even on Anthropic's 5-minute cache after two cache hits — one of the cheapest break-even points in any cloud infrastructure. Past that, every hit is pure savings.

A worked example

A customer-support assistant on Claude Sonnet 4.6: 50,000 conversations per day, 10 turns per conversation, a stable 25,000-token system prompt plus knowledge base, and a 200-token user turn. That's 500,000 turns per day, each with 25,200 input tokens and — let's say — 300 output tokens.

  • Without caching: 500,000 × 25,200 input tokens = 12.6B input tokens/day. At $3/MTok, that's ~$37,800/day on input alone.
  • With caching (first turn writes, nine turns read): 50,000 × 25,000 tokens written at $3.75/MTok = $4,687, plus 450,000 × 25,000 tokens read at $0.30/MTok = $3,375. Input cost drops to ~$8,062/day — a 79% reduction on input spend.
  • Output cost is unchanged in both cases, so the end-to-end bill reduction is slightly smaller — typically 60–75% in practice, depending on how output-heavy the workload is.

Output tokens aren't cached. If your response is long and repetitive, structured output (JSON with a tight schema) cuts total spend faster than any cache tuning. Prompt caching is an input-side lever only.

Where to put cache_control — the rules that matter

Anthropic's caching matches on the exact prefix up to and including each cache_control marker. The order of blocks matters: tools, then system, then messages. You get up to four breakpoints per request, and longer-TTL breakpoints must appear before shorter-TTL ones. Violate any of these and your hit rate silently collapses.

// Tool-calling agent with two cache breakpoints
// BP1: stable agent instructions — 1-hour TTL (rarely change)
// BP2: end of turn N-1 — 5-minute TTL (conversation state)
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 2048,
  tools,
  system: [
    {
      type: "text",
      text: AGENT_INSTRUCTIONS,
      cache_control: { type: "ephemeral", ttl: "1h" },
    },
  ],
  messages: [
    ...priorTurns.slice(0, -1),
    {
      ...priorTurns.at(-1)!,
      // Mark the last prior turn with a 5-min breakpoint
      // so the next request reads everything up to here
      content: attachCacheControl(priorTurns.at(-1)!.content, "5m"),
    },
    { role: "user", content: newUserTurn },
  ],
});

// On every response, check usage to confirm cache hit
// response.usage.cache_read_input_tokens > 0 → hit
// response.usage.cache_creation_input_tokens > 0 → write
  • Put the most stable content first — agent instructions, tool definitions, pinned knowledge base. Anything that changes per request goes at the end, after your last cache_control marker.
  • Use a 1-hour TTL for prefixes that change less than hourly (system prompts, knowledge bases). Use the 5-minute default for conversation state. Order breakpoints with the longer TTL first.
  • Never interpolate a user ID, timestamp, or request ID into the cached prefix. One character difference and the cache misses.
  • Verify hits in production. Log cache_read_input_tokens and cache_creation_input_tokens on every response. If your cache hit rate is below 70% on a steady-state workload, something is busting the prefix.

The traps that silently tank cache rates

Most of the caching bugs we find on client projects fall into a small set of predictable traps. Every one of them shows up the same way in metrics: cache_creation_input_tokens is high, cache_read_input_tokens is low, and the bill doesn't drop. Here's the checklist we run through first.

  1. Dynamic content in the system prompt. A current-date string, a user name, or a request ID interpolated into the system prompt means every request has a different prefix and every request writes a new cache entry.
  2. Non-deterministic tool ordering. If your tool array is built from an object's values, Node.js and browser engines can return different orders across runs. Sort tool definitions by a stable key before sending.
  3. Breakpoint after every message. More breakpoints isn't better — each one is a separate cache entry with its own write cost. Two well-placed breakpoints usually beat four sloppy ones.
  4. 5-minute TTL on high-value stable content. If you're paying to write a 50K-token system prompt and the traffic is bursty, the 1-hour TTL pays off fast. Run the math on your actual traffic pattern, not a steady-state assumption.
  5. Cache content below the minimum. Anthropic requires a minimum of 1,024 tokens (2,048 on some models) before cache_control is honoured. Short prefixes aren't cached, even if you mark them.

Stacking caching with batching

Batch APIs give you roughly 50% off for latency-tolerant workloads — overnight embeddings, bulk classification, backfills. Anthropic's batch and cache discounts stack. A workload that can use both — say, a nightly re-classification run against a stable prompt — can land at 95% off input token cost versus the naive baseline. That's not a theoretical ceiling; that's what we see on actual client invoices for well-engineered batch pipelines.

Batch + cache is the lowest-hanging fruit on any backlog workload. If you have a queue of documents to process and you haven't wired this up, it's usually a half-day of work for a 5–10× cost reduction.

When caching hurts you

Caching is net-negative in a small but real set of situations. If your prompt prefix is genuinely different on every request — heavy personalisation, per-request retrieval where the retrieved chunks change every turn, one-off completions — you're paying the write surcharge on every call and getting no reads in return. The breakeven on Anthropic's 5-minute cache is two hits, but if you only ever get one, caching costs you 25% more than not caching.

  • If over 30% of your requests are the first-ever request for a given prefix, review whether caching is helping. Add caching to the stable part (system prompt, tool definitions) and leave the dynamic part uncached.
  • If TTLs are timing out between requests on a low-traffic endpoint, either batch requests within the TTL window or drop to the 1-hour tier.
  • If the prompt is below the minimum size, there's nothing to optimise — caching won't activate at all.

A migration playbook

When we take over an LLM workload that isn't caching yet, the playbook is the same every time. It takes a day or two and usually returns 50–80% of the input token spend on the first pass.

  1. Instrument first. Log cache_read_input_tokens and cache_creation_input_tokens on every response. You can't tune what you can't see.
  2. Identify stable prefixes. System prompts, tool definitions, pinned knowledge — anything that doesn't change per request. Move them to the top of the prompt if they aren't already.
  3. Eliminate accidental dynamism. Strip timestamps, UUIDs, user-specific data from the stable section. If you need to include user context, put it after the last cache_control marker.
  4. Add cache_control markers. Start with one at the end of the system prompt, and — for multi-turn flows — one at the end of the prior message. Resist the urge to add more until the first pass is proven.
  5. Verify. Cache hit rate on steady-state traffic should be north of 80%. If it isn't, something in the prefix is varying and you haven't found it yet.

Key takeaways

  • Caching is the single biggest cost lever on LLM workloads. A 60–80% input-spend reduction is typical; 90%+ is achievable on RAG and agent workloads.
  • Anthropic gives the deepest discount and the most control; OpenAI is automatic but shallower; Gemini 2.5 sits between. Pick the provider based on workload, not on caching alone.
  • cache_control placement is rule-driven, not heuristic. Stable content first, longer TTL before shorter, never interpolate dynamic values into the cached prefix.
  • Verify in production with cache_read and cache_creation counters. Cache hit rate below 70% on steady-state traffic means something is silently cache-busting.
  • Stack caching with batch APIs for async workloads — 95% off input token cost is realistic, not theoretical.
#prompt-caching#anthropic#cost-optimization#openai#gemini#llm
Working on something similar?

Let's build it together.

We ship production SaaS, marketplaces, and web apps. If you want an engineering partner — not a consultancy — let's talk.