The real cost of running LLMs at scale: token economics for SaaS founders

A SaaS founder who has never projected a monthly LLM bill from first principles is a SaaS founder about to be surprised. Token economics is the most important spreadsheet in a 2026 AI-enabled product, and most teams don't build it until a month-end invoice forces the conversation. The team has run the model enough times on enough client projects to see the same shapes recur: the same mis-sized workloads, the same missed optimizations, the same cost traps. This is the structure the team uses when sizing spend for a new feature — current pricing, the levers that actually move the bill, a worked example, and the single biggest mistake to avoid.

What the frontier models cost in April 2026

Standard input and output pricing sets the ceiling — caching, batching, and model routing bring the realized cost down from there. Below is the pricing the team is working against this month, verified against each vendor's public rate sheet.

Model	Input / 1M tok	Output / 1M tok	Cached input	Batch discount
Claude Opus 4.7	$5.00	$25.00	~$0.50 (10%)	50% off
Claude Sonnet 4.6	$3.00	$15.00	~$0.30 (10%)	50% off
Claude Haiku 4.5	$0.25	$1.25	~$0.025 (10%)	50% off
GPT-5.4 Pro	$30.00	$180.00	25% of input	50% off
GPT-5.4	$2.50	$15.00	25% of input	50% off
GPT-5.2	$1.75	$14.00	25% of input	50% off
Gemini 3.1 Pro (≤200K)	$2.00	$12.00	25% of input	50% off
Gemini 3 Flash	$0.50	$3.00	25% of input	50% off
Gemini 3.1 Flash-Lite	$0.10	$0.40	25% of input	50% off

Two patterns pop out. First: output tokens cost 3–6x input tokens on every frontier model. A product that outputs long prose burns budget far faster than one that outputs structured JSON. Second: Anthropic's prompt caching is materially more aggressive than the competition — cached reads drop to roughly 10% of fresh-read price, against 25% on OpenAI and Google. At high cache-hit rates, that gap is the difference between margin and no margin.

The five levers that move the bill

Model choice — route cheap work to cheap models. This is the single biggest lever and the one teams most consistently leave on the table.
Prompt caching — 50–90% off on the stable prefix of every request. Anthropic caching hits 10% of fresh-read cost; OpenAI and Google sit at 25%.
Batch APIs — 50% off for any workload that tolerates a 24-hour turnaround. Embeddings backfills, overnight classification, bulk summarisation.
Context engineering — the tokens you don't send are the cheapest tokens. Tight system prompts and retrieval-capped context are free money.
Output discipline — cap max_tokens, use structured output, and don't ask for prose the user won't read.

Prompt caching is the lever most teams miss

Prompt caching is underused and undervalued. On a RAG workload where the system prompt plus a stable knowledge-base prefix runs 20,000 tokens, a cached read on Claude Sonnet 4.6 costs roughly $0.006 per turn; a fresh read costs $0.060. That 10x gap compounds across every user, every turn, every day. Stacked with batch processing for async paths, Anthropic workloads can hit effective rates of $0.15 per million cached input tokens — 95% below the sticker price.

If your cache hit rate under production load is under 70%, there is substantial cost on the floor. Reorder the message payload so the stable prefix (system prompt, persona, static context) comes first and the volatile user turn comes last — that single change has moved hit rates from 30% to 85% on client projects.

A worked example — projecting a monthly bill

Say you are shipping a customer-support assistant. 2,000 daily active users, 8 conversation turns per DAU per day on average, 15,000 tokens of retrieved context per turn, 250 output tokens per turn. That is 16,000 turns per day, 480,000 turns per month.

Naive pricing — no caching, no routing

Send every turn to Claude Sonnet 4.6 at full price. Input: 480,000 × 15,000 = 7.2 billion tokens at $3/M = $21,600/month. Output: 480,000 × 250 = 120M tokens at $15/M = $1,800/month. Total: $23,400/month. If you are charging $20/user/month, your AI cost is 58% of revenue. The feature is losing money.

Optimized — caching plus routing

Route 70% of turns (simple questions, no escalation) to Claude Haiku 4.5. Route 30% to Sonnet. Cache the 15,000-token context prefix on both models; assume 80% cache hit rate under load.

Haiku input: 336K turns × 15K tokens × (0.8 × $0.025 + 0.2 × $0.25) / 1M = $358/month.
Haiku output: 336K × 250 × $1.25 / 1M = $105/month.
Sonnet input: 144K × 15K × (0.8 × $0.30 + 0.2 × $3.00) / 1M = $1,814/month.
Sonnet output: 144K × 250 × $15 / 1M = $540/month.
Total: ~$2,817/month — an 88% reduction on the naive version.

Same user count, same quality bar, same feature. The difference between $23,400 and $2,817 is pure architecture. At $20/user/month revenue, AI cost drops from 58% to under 7% of revenue. That is a viable product.

The team runs this exercise at the start of every AI project. If the naive number does not fit a realistic price point and the optimized number cannot be modeled with credible assumptions, the feature as scoped does not ship. Better to redesign on a spreadsheet than on a burning invoice.

Input versus output — the ratio that decides everything

Output tokens dominate cost on short-input workloads and input tokens dominate on long-context workloads. Which bucket your feature sits in is a design decision, not a fact of nature.

Chatbots with long responses — output-heavy. Cap max_tokens, use structured outputs where you can, and stream early so users don't wait.
RAG with short answers — input-heavy. Cache the context prefix, tighten retrieval, and stop passing chunks the model won't use.
Classification and extraction — input-heavy, output-tiny. Use the cheapest capable model; don't reach for Sonnet when Haiku scores the same on your eval set.
Agentic workflows — both, plus re-prompting. Every retry, every tool-use error, every re-plan is paid for again. Instrument turn counts and budget them.

The biggest cost trap — uncapped output and silent retries

The most expensive mistake the team sees in production is a feature that does not cap output tokens and does not cap retries. The combination is explosive. A user hits the regenerate button five times on a long-form feature with no max_tokens set, the model returns 8,000-token responses each time, and a single interaction burns what 100 normal interactions cost. Multiply across a thousand users on a bad day and the bill doubles overnight.

Always set max_tokens on every call. Always cap regenerations per session. Always log full token usage per user. The team has audited client systems where the absence of these three controls was responsible for more than half of the monthly bill — and the fix was a week of work that paid for itself on the next invoice.

Context engineering — the tokens you don't send

Every token in the system prompt is paid for on every request. Every unused retrieval chunk is paid for on every request. Every example in a few-shot template is paid for on every request. Context engineering is the unglamorous discipline of shaving these — and at scale it buys real money.

Write a tight system prompt. Most are 2–3x longer than they need to be. Cutting a 1,500-token prompt to 500 tokens saves two-thirds of every prefix read.
Compress examples. Few-shot prompts work; five examples rarely outperform two. Pick the best two and stop.
Cap retrieval context. If your RAG pipeline passes 20 chunks and the model cites two, you are paying for 18 chunks of noise. Start at top-5 and only raise it if evals show you need to.
Summarize on write, not on read. Pre-compute chunk summaries at ingest; retrieve the summaries for classification and routing, retrieve the full chunks only for final answer generation.

Key takeaways

Output tokens cost 3–6x input tokens on every frontier model in 2026. Output discipline is a first-class cost lever.
Prompt caching on Anthropic cuts cached input to ~10% of fresh-read price. Stacked with batch pricing, effective rates drop 95% below sticker.
A two-model router (Haiku or Flash-Lite for cheap work, Sonnet or GPT-5 for hard work) typically cuts LLM spend 60–80% with no visible quality drop.
Project your monthly bill from first principles before you launch. If the naive number is unaffordable and the optimized number is uncertain, redesign the feature.
Always cap max_tokens and retries. Uncapped output plus silent retries is the single biggest cost trap in production AI systems.
Context engineering (tight system prompts, capped retrieval, pre-computed summaries) is the quietest and most consistent saving available.

#ai#llm#token-economics#cost-optimization#prompt-caching#saas