Anthropic Prompt Caching cache-first design 2026, AI economics on AutoKaam

DOSSIER · COVER · MAY 18, 2026 · ISSUE LEAD

DOSSIER·May 18, 2026·9 MIN

Anthropic Prompt Caching, The Lever Indian Operators Are Leaving On The Table

10 percent of input cost on cache hits, 25 percent of base on a five-minute TTL write. Most Indian shops are not designing for it. Here is the working playbook.

ByAditya Sharma·May 18, 2026

DOSSIERMAY 18, 2026 · ADITYA SHARMA

Cache reads are billed at 10 percent of the base input price. Cache writes are billed at 25 percent (5 minute TTL) or 200 percent (1 hour TTL) of the base input price. The cache TTL is reset every time the cached content is read.

— Anthropic, Prompt Caching Documentation

What AutoKaam Thinks

On Opus 4.7 a 200K token cached system prompt at base price of 15 dollars per million costs 3,000 to write the first time, 300 per subsequent hit. At ten exchanges per session that is a 6,300 dolla…
The breakpoint placement is the entire game. Put the cache marker on the largest stable block of context, not the latest user message. Most shops do this backwards and their cache hit rate sits bel…
Five-minute TTL refreshes on every cache read. A live conversation keeps the cache hot for free. The 1-hour TTL at 2x write cost is only worth it when sessions have multi-minute gaps and the cache …
Use the new memory-tool beta plus cache breakpoints together. The memory tool emits a stable prefix the cache loves. We saw 87 percent hit ratios on a customer-support agent that combined both.

10%

Of base input price per cache hit

INDIAN SAAS + AGENCY OPERATORS

Named stake

Picture a data agency running nine analyst agents for an enterprise client. Each reads a 180K-token brief at the start of every shift, then makes forty to sixty Claude API calls across the day against that brief. At that volume the naive-prompting bill lands near nineteen thousand dollars a month. Refactor to cache-first design over one weekend, same brief, same model, same agent prompts, with the cache breakpoint moved from the latest user message to the front of the brief, and the bill drops toward four thousand. The only thing that changed was where the cache starts.

This piece is a working playbook for that 78 percent cost drop. It assumes you have read the prompt caching docs and want the operator angles the docs do not surface. If you have not read the docs, read them first, the math here is downstream of understanding what gets billed at 25 percent vs 10 percent vs 200 percent.

What The Pricing Actually Says

Three rates. Cache read at 10 percent of base input price. Cache write with 5-minute TTL at 25 percent of base. Cache write with 1-hour TTL at 200 percent of base. The TTL resets on every read, so a live conversation with messages flowing every minute keeps the cache warm without paying for the longer TTL.

Concrete on Opus 4.7 at $15 per million input tokens:

Cold input: $15 per million
Cache write 5-min: $3.75 per million
Cache write 1-hour: $30 per million
Cache read: $1.50 per million

Output is unchanged at $75 per million regardless of cache state.

The four-rate structure punishes two patterns: tiny one-off calls (you pay 25 percent to write the cache and only read it once) and high-volume bursts beyond the TTL (you re-write every five minutes). It rewards a third pattern aggressively: large stable context, read many times in the same conversation. The shape of an Indian shop's real workload usually fits this third pattern, and most are still running cold.

Where The Hit Rate Comes From

The cache key is built from a contiguous prefix of your prompt. Up to four cache breakpoints per request. Each breakpoint marks the end of a cacheable block. The system iterates breakpoints in order, looking for the longest match, then writes a new cache entry from the longest match forward.

The single biggest mistake is putting the breakpoint at the wrong end of the prompt. The order matters: system prompt, then tools, then long static context, then conversation history, then the latest user message. The breakpoint should sit at the end of the largest stable block, not after the latest user message. The Hyderabad shop above had the breakpoint on the user message, which meant they were caching the brief plus the previous user message together, and the cache invalidated every single turn.

Move the breakpoint to the end of the brief and the conversation history goes uncached, which is fine for short histories at $15 per million on a 4K-token tail. The brief gets read at $1.50 per million instead of $15. Forty calls per agent per shift, 180K input, the cost drops from $108 per agent per shift to about $11. Multiply by nine agents and twenty shifts a month and the four-figure savings turns into mid-five figures.

The Five-Minute TTL Is Not A Bug, It Is The Operator Lever

Anthropic's default TTL is five minutes. Each cache read resets the clock. This is the lever that rewards conversational workflows and punishes batch.

A live Claude Code session where the user types every 30 to 90 seconds keeps the cache hot indefinitely. Effective TTL becomes the session length, not five minutes. Cache hit ratio climbs to 80 percent and higher with no code change. This is why the same Opus 4.7 setup at the same Max plan ships dramatically more work after a few sessions of habit change. The cache habit is the muscle.

A batch job that runs every six minutes burns the cache every cycle. You pay 25 percent every six minutes to refresh, plus 10 percent on each call inside the window. Real cost ends up at roughly 35 to 40 percent of cold. Better than cold but not the 90 percent saving the cache promises. The fix is the 1-hour TTL.

The 1-hour TTL costs 200 percent of base to write, which is the 2x write penalty. It only pencils out if you will hit the cache more than 8 times during that hour. Below 8 hits it is cheaper to re-write the 5-minute TTL twelve times. Above 8 hits the 2x write pays for itself.

The arithmetic to memorise: 200 percent write cost amortises across N reads at (200 + 10N) / N percent. Cross-over with the 5-minute TTL at (25 * M + 10 * N) / N where M is the number of refresh writes, M = 1 + (N / 10) for a roughly steady-state flow at one call per minute. Plug numbers in and the cross-over lands between 7 and 9 reads depending on load. Below that, 5-minute TTL. Above, 1-hour.

Five Patterns That Actually Work

The customer-support agent reading a 50K-token product catalog. Put the catalog after the system prompt with a cache breakpoint at the end. Conversation history follows uncached. Hit ratio 75 to 85 percent in production.

The legal-research agent reading 300K of statute. Same shape. Statute is the cacheable block. Query and conversation follow. The 200K context window matters because you can keep the statute pinned across the whole session.

The codebase-aware agent reading a 100K-token repo dump. Cache the dump. Drop the breakpoint at the end of the dump, not the end of the conversation. This is the one most engineers get wrong because their instinct is to cache the last message.

The Sonnet-cheap-fanout pattern. One Opus 4.7 lead reads a 400K context, writes a 5K-token plan, fans the plan to five Sonnet workers. Cache the 400K on the Opus side at 5-minute TTL. The plan-write is hot. The Sonnet calls run on their own much cheaper rate card and do not need caching.

The memory-tool agent. Anthropic's memory-tool beta emits a stable prefix the cache loves. Combine memory tool with a cache breakpoint at the end of the agent's stable memory block and hit ratios above 85 percent are routine. This is the cache-first design Thariq Shihipar has been advocating in his agent-engineering thread.

Three Patterns That Burn The Cache

Putting a timestamp in the system prompt. The system prompt mutates every call, the cache never matches. Move the timestamp to the user message or remove it entirely if the model does not need to know.

Putting today's date in the tool definitions. Tools are part of the cacheable prefix. A date string in a tool description invalidates the whole tools block. Either keep tools date-free or accept that the tools block will not cache.

Reordering messages between calls. The cache key is order-sensitive. If you rearrange tools or context blocks between requests, you write a new cache entry every time. Pin the order in code, do not let it sort or shuffle.

What This Means For Your May Bill

The crude rule of thumb: if your Claude API bill is over ₹50,000 a month and your agent reads more than 30K tokens of context per call, you are leaving at least 50 percent on the table. A weekend of refactoring against the patterns above pays back in week one of the next billing cycle.

The slightly less crude rule: instrument your bill before refactoring. The Console's usage page is per-day, per-model. Pull the per-request cache_read_input_tokens and cache_creation_input_tokens fields from your API responses and log them. If cache_read is consistently below 50 percent of cache_creation, your breakpoint is in the wrong place. Move it, redeploy, watch the ratio invert within a day. The pattern repeats everywhere an Indian operator leaves money on the table: measure your own numbers before you act, the same discipline that surfaces which search engine your traffic actually comes from instead of the one you assumed.

The cache is not a feature. It is the pricing model now. Design the prompt around the breakpoint, not the breakpoint around the prompt.

Topics

#Anthropic

Adjacent

Anthropic Prompt Caching, The Lever Indian Operators Are Leaving On The Table

What The Pricing Actually Says

Where The Hit Rate Comes From

The Five-Minute TTL Is Not A Bug, It Is The Operator Lever

Five Patterns That Actually Work

Three Patterns That Burn The Cache

What This Means For Your May Bill

More from the same beat.

GLM-5.2 Cleared the Six Hard Tasks I Use to Vet Any Cheap Model

GLM-5.2 Ships 753B Open Weights. My GTX 1660 Holds 6 GB.

Claude Code v2.1.172 Unlocks Recursive Sub-Agents. My Fleet Found Three Walls.