Memoize Your LLM Calls to Stop Burning Quota on Answers You Have
A content-addressed cache that replays deterministic LLM and CLI output at zero tokens, fenced off from anything that must stay fresh.

I kept paying for the same answer.
A classifier prompt in a pipeline. A schema-extraction call I ran every time I re-tested the script. A CLI step in CI that asked a model the exact same question on every push. The inputs had not changed. The model had not changed. The output was identical every time. And every run spent tokens, spent a slice of my pool quota, and made me wait for a reply I already had on screen an hour ago.
So I built a memoization layer. It wraps any deterministic model or CLI call, stores the output keyed on the request, and on a repeat call replays it from disk at zero tokens and zero quota. I call the tool deja. This is how it works and, more important, where I refuse to point it.
Memoization vs prompt caching
These get confused constantly, so let me draw the line first.
Server-side prompt caching is prefix-shrinking. You send a long stable prefix (a system prompt, a big document), the provider caches the processed prefix on their side, and your next call that reuses that prefix is cheaper and faster. But you still make the call. You still get billed, at a reduced rate for the cached tokens and full rate for everything new. The model still runs. It is a discount on work, not a refusal to do it. (If you want the deep version of that, I wrote about cache-first design for prompt caching.)
Memoization is different in kind. It is client-side. The full request is the cache key. If that exact request has been answered before, you do not call the provider at all. No tokens. No quota. No latency past a disk read. The model never runs because there is nothing new to compute.
| Prompt caching | Memoization (deja) | |
|---|---|---|
| Where it lives | Provider side | My machine |
| Keyed on | Stable prefix of the prompt | The whole request |
| On a repeat | Cheaper call, still billed | No call at all |
| Tokens spent | Reduced | Zero |
| Best for | Long shared context, fresh tail | Deterministic work you re-run |
One is a discount. The other is a refusal to re-spend on work already done. You can use both, and they do not overlap.
How it works
The whole design sits on one idea: a deterministic call is a pure function of its inputs, so cache it like one.
The key is sha256(model + prompt-or-argv + content-hashes-of-any-input-files). Three pieces:
- Model. The pin is in the key. My codex bridge runs
gpt-5.5atxhigh, and that pin lives in the hash. The day I bump the model, every key shifts and the old answers stop matching. A model upgrade auto-invalidates the cache. No stale replay from a model you retired. - Prompt or argv. The exact question, or the exact command line. Change one byte and you get a new key.
- Input file content-hashes. If the call reads files, their content is hashed in. Edit an input and the key moves on its own.
The loop is boring, which is the point. Compute the key. Look it up. Hit means replay the stored output from disk. Miss means run the real call, then store the output under that key. Change any input byte and you get a correct, automatic miss. Nothing to invalidate by hand.
The store is a local SQLite file. Mine sits under my Claude state directory and rides along in my normal backups. No service, no daemon, no network.
Using it
The interface is a wrapper. You prefix your existing command with deja and a verb. Nothing else changes.
For a one-shot model question through the codex bridge:
# First run: real call, answer stored under the key
deja gpt ask "List the HTTP idempotent methods, one per line."
# Same question later: replayed from disk, zero tokens, zero quota
deja gpt ask "List the HTTP idempotent methods, one per line."
The first call is a normal miss and costs a pool message. The second is a hit and costs nothing. In my own runs I have watched a roughly 5-second miss collapse to a 0-second hit on the repeat, with the saved-message ledger ticking up.
For a headless Claude call, which spends the metered session window rather than a token pool:
deja claude -p "Classify this commit message as feat, fix, chore, or docs: $MSG"
It captures stdin too, so a heredoc or a pipe is part of the key. The cache stays honest about what actually went in.
For any deterministic CLI or pipeline step, there is a generic wrapper:
# Memoize a deterministic extraction step in a build
deja run --tag extract --file schema.json -- python extract_entities.py schema.json
--file folds that file's content into the key, so editing schema.json forces a clean re-run. Only exit-0 runs get cached, so a failure never poisons the store. And deja's own chatter goes to stderr, which means stdout is byte-for-byte what the wrapped command produced. It is a drop-in. Pipe it, redirect it, diff it, and nothing downstream can tell deja is in the path.
There is a quiet bonus here: incremental pipelines for free. Point a 200-item cron job through deja, and on a run where only two inputs changed, the 198 unchanged items hit the cache and you pay for two. You did not write any diffing logic. The content-addressed key did it.
Where it's safe and where it's dangerous
This is the part to read twice.
deja is safe for deterministic machine work. Classify. Extract. Parse. Score. Summarize something stable. Embed. Anything where the same input should always give the same output, and where a stale-but-identical answer is exactly right. Repeated CI checks, re-tested dev scripts, and audits over inputs that have not moved are the sweet spot.
deja is dangerous for anything that must be fresh. I never point it at voice-critical content. The tutorials and posts under my bylines get written live, every time, because freshness and a human edge are the entire product there. A replayed paragraph is a dead paragraph. That fence is hard. There is no clever exception to it.
One more fence, subtle but real. A hit replays the model's own prior answer to the identical request. So when I genuinely want a fresh re-roll, a second opinion, a different take, I do not call deja. I call the tool directly. Memoization is not a substitute for asking again when asking again is the point.
I learned this fence the unsexy way. A keyword filter I wrote tried to auto-classify which jobs were cache-safe and lumped a scoring audit (safe) together with a content-generation pipeline (not safe). They looked similar from the outside and were opposites in fact. The lesson stuck: I do not auto-enable memoization everywhere. I wire it into deterministic callers one at a time, by hand, and leave interactive and creative calls untouched.
What I save in practice
I am not going to hand you a fake percentage. The honest answer is that the saving scales with how repetitive your work actually is, and that varies wildly by project.
Where it pays clearly for me:
- Dev and debug re-runs. I run the same deterministic prompt twenty times while fixing the code around it. Nineteen of those are now free.
- CI on quiet inputs. A push that did not touch the inputs to a model step replays instead of re-spending.
- Incremental crons. The 200-item job that sees two new items pays for two.
Where it does not pay, and I am up front about this: many of my report scripts embed today's date or the latest data, so the prompt changes daily and never hits. Memoization only helps when the request is genuinely repeated byte-for-byte. If your prompts mutate every run, you will get misses, which is the correct behaviour, not a bug.
When NOT to memoize
A short, blunt checklist. Skip deja when:
- The output must be fresh or creative. Voice content, anything customer-facing that should not read as canned.
- You want a different answer than last time. A re-roll, a second opinion, a creative variation.
- The call has side effects or hits live data. Web search, image generation, anything that writes or fetches the current state of the world. I keep those verbs as straight passthrough; they are never cached.
- The inputs are unique every run. A daily report stamped with the date will miss anyway, so the wrapper only adds noise.
Everything else that is deterministic and repeated is fair game. Wrap it once and stop paying twice.
Related
- Cheap API automation with DeepSeek V3.2. Pair a cheap model with memoization and the marginal cost of a repeat call goes to zero.
- Anthropic prompt caching and cache-first design. The server-side discount that complements client-side memoization.
- How I built a 5-model LLM fallback router. The routing layer that decides which model runs before deja decides whether it runs at all.
- Claude Code setup for Indian developers. The local toolchain I wrap these calls around.
More Automation

Cloudflare API Token Gotchas: The PUT That Wiped Mine Twice
I broke production twice by updating a Cloudflare token's scopes through the public API, then learned the wrangler auth fix and a secret-scrub habit the hard way. This is exactly what bit me and how I handle tokens now.

Fix NVIDIA Cursor and Video Stutter on Linux: GPU Clock Thrash
Cursor jitter and dropped video frames on NVIDIA Linux get blamed on the compositor every time. On my GTX 1660 the real cause was the driver bouncing graphics and VRAM clocks under light load. Here is the fix that held.

Litestream to Cloudflare R2: Disaster Recovery for SQLite
SQLite on one free box is one disk failure away from gone. Here is the exact Litestream-to-R2 setup I run across every PocketBase backend in my stack, including the restore drill and the gotcha that bites first.