⚡Automationintermediate

Memoize Your LLM Calls to Stop Burning Quota on Answers You Have

A content-addressed cache that replays deterministic LLM and CLI output at zero tokens, fenced off from anything that must stay fresh.

ByAditya Sharma·Jun 1, 2026·7 min read

A terminal showing a memoized LLM call returning a cached hit in zero seconds

I kept paying for the same answer.

A classifier prompt in a pipeline. A schema-extraction call I ran every time I re-tested the script. A CLI step in CI that asked a model the exact same question on every push. The inputs had not changed. The model had not changed. The output was identical every time. And every run spent tokens, spent a slice of my pool quota, and made me wait for a reply I already had on screen an hour ago.

So I built a memoization layer. It wraps any deterministic model or CLI call, stores the output keyed on the request, and on a repeat call replays it from disk at zero tokens and zero quota. I call the tool deja. This is how it works and, more important, where I refuse to point it.

Memoization vs prompt caching

These get confused constantly, so let me draw the line first.

Server-side prompt caching is prefix-shrinking. You send a long stable prefix (a system prompt, a big document), the provider caches the processed prefix on their side, and your next call that reuses that prefix is cheaper and faster. But you still make the call. You still get billed, at a reduced rate for the cached tokens and full rate for everything new. The model still runs. It is a discount on work, not a refusal to do it. (If you want the deep version of that, I wrote about cache-first design for prompt caching.)

Memoization is different in kind. It is client-side. The full request is the cache key. If that exact request has been answered before, you do not call the provider at all. No tokens. No quota. No latency past a disk read. The model never runs because there is nothing new to compute.

	Prompt caching	Memoization (deja)
Where it lives	Provider side	My machine
Keyed on	Stable prefix of the prompt	The whole request
On a repeat	Cheaper call, still billed	No call at all
Tokens spent	Reduced	Zero
Best for	Long shared context, fresh tail	Deterministic work you re-run

One is a discount. The other is a refusal to re-spend on work already done. You can use both, and they do not overlap.

How it works

The whole design sits on one idea: a deterministic call is a pure function of its inputs, so cache it like one.

The key is sha256(model + prompt-or-argv + content-hashes-of-any-input-files). Three pieces:

Model. The pin is in the key. My codex bridge runs gpt-5.5 at xhigh, and that pin lives in the hash. The day I bump the model, every key shifts and the old answers stop matching. A model upgrade auto-invalidates the cache. No stale replay from a model you retired.
Prompt or argv. The exact question, or the exact command line. Change one byte and you get a new key.
Input file content-hashes. If the call reads files, their content is hashed in. Edit an input and the key moves on its own.

The loop is boring, which is the point. Compute the key. Look it up. Hit means replay the stored output from disk. Miss means run the real call, then store the output under that key. Change any input byte and you get a correct, automatic miss. Nothing to invalidate by hand.

The store is a local SQLite file. Mine sits under my Claude state directory and rides along in my normal backups. No service, no daemon, no network.

Using it

The interface is a wrapper. You prefix your existing command with deja and a verb. Nothing else changes.

For a one-shot model question through the codex bridge:

# First run: real call, answer stored under the key
deja gpt ask "List the HTTP idempotent methods, one per line."

# Same question later: replayed from disk, zero tokens, zero quota
deja gpt ask "List the HTTP idempotent methods, one per line."

The first call is a normal miss and costs a pool message. The second is a hit and costs nothing. In my own runs I have watched a roughly 5-second miss collapse to a 0-second hit on the repeat, with the saved-message ledger ticking up.

For a headless Claude call, which spends the metered session window rather than a token pool:

deja claude -p "Classify this commit message as feat, fix, chore, or docs: $MSG"

It captures stdin too, so a heredoc or a pipe is part of the key. The cache stays honest about what actually went in.

For any deterministic CLI or pipeline step, there is a generic wrapper:

# Memoize a deterministic extraction step in a build
deja run --tag extract --file schema.json -- python extract_entities.py schema.json

--file folds that file's content into the key, so editing schema.json forces a clean re-run. Only exit-0 runs get cached, so a failure never poisons the store. And deja's own chatter goes to stderr, which means stdout is byte-for-byte what the wrapped command produced. It is a drop-in. Pipe it, redirect it, diff it, and nothing downstream can tell deja is in the path.

There is a quiet bonus here: incremental pipelines for free. Point a 200-item cron job through deja, and on a run where only two inputs changed, the 198 unchanged items hit the cache and you pay for two. You did not write any diffing logic. The content-addressed key did it.

Where it's safe and where it's dangerous

This is the part to read twice.

deja is safe for deterministic machine work. Classify. Extract. Parse. Score. Summarize something stable. Embed. Anything where the same input should always give the same output, and where a stale-but-identical answer is exactly right. Repeated CI checks, re-tested dev scripts, and audits over inputs that have not moved are the sweet spot.

deja is dangerous for anything that must be fresh. I never point it at voice-critical content. The tutorials and posts under my bylines get written live, every time, because freshness and a human edge are the entire product there. A replayed paragraph is a dead paragraph. That fence is hard. There is no clever exception to it.

One more fence, subtle but real. A hit replays the model's own prior answer to the identical request. So when I genuinely want a fresh re-roll, a second opinion, a different take, I do not call deja. I call the tool directly. Memoization is not a substitute for asking again when asking again is the point.

I learned this fence the unsexy way. A keyword filter I wrote tried to auto-classify which jobs were cache-safe and lumped a scoring audit (safe) together with a content-generation pipeline (not safe). They looked similar from the outside and were opposites in fact. The lesson stuck: I do not auto-enable memoization everywhere. I wire it into deterministic callers one at a time, by hand, and leave interactive and creative calls untouched.

What I save in practice

I am not going to hand you a fake percentage. The honest answer is that the saving scales with how repetitive your work actually is, and that varies wildly by project.

Where it pays clearly for me:

Dev and debug re-runs. I run the same deterministic prompt twenty times while fixing the code around it. Nineteen of those are now free.
CI on quiet inputs. A push that did not touch the inputs to a model step replays instead of re-spending.
Incremental crons. The 200-item job that sees two new items pays for two.

Where it does not pay, and I am up front about this: many of my report scripts embed today's date or the latest data, so the prompt changes daily and never hits. Memoization only helps when the request is genuinely repeated byte-for-byte. If your prompts mutate every run, you will get misses, which is the correct behaviour, not a bug.

When NOT to memoize

A short, blunt checklist. Skip deja when:

The output must be fresh or creative. Voice content, anything customer-facing that should not read as canned.
You want a different answer than last time. A re-roll, a second opinion, a creative variation.
The call has side effects or hits live data. Web search, image generation, anything that writes or fetches the current state of the world. I keep those verbs as straight passthrough; they are never cached.
The inputs are unique every run. A daily report stamped with the date will miss anyway, so the wrapper only adds noise.

Everything else that is deterministic and repeated is fair game. Wrap it once and stop paying twice.

Cheap API automation with DeepSeek V3.2. Pair a cheap model with memoization and the marginal cost of a repeat call goes to zero.
Anthropic prompt caching and cache-first design. The server-side discount that complements client-side memoization.
How I built a 5-model LLM fallback router. The routing layer that decides which model runs before deja decides whether it runs at all.
Claude Code setup for Indian developers. The local toolchain I wrap these calls around.

Topics

#Indian operator

More Automation

Terminal showing a structuredData.json table extraction from a scanned PDF via Adobe PDF Services REST

⚡Automationintermediate

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output

I wired Adobe PDF Services REST into my stack as a local tool and pointed it at the scanned invoices and merged-header statements that pdfplumber turned into soup. Here is the exact auth flow, the extract call, and the structuredData.json parsing I run in production, with the real latency and free-tier limits.

Jun 28, 2026·8 min read

An AT-SPI2 accessibility tree of a GTK dialog with element names and roles, next to the same dialog being driven by an agent

⚡Automationadvanced

I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2

I was tired of my agent missing buttons because a window shifted a few pixels. So I pointed it at the AT-SPI2 accessibility tree instead, the same data a screen reader consumes, and had it act by element name and role. This walks through driving a GTK dialog and a native Save dialog, then reading the value back to prove the action actually landed.

Jun 28, 2026·9 min read

Cloudflare named tunnel exposing a self-hosted app, kept reboot-proof with a systemd unit

⚡Automationintermediate

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production

I expose every self-hosted app on my home box through a Cloudflare named tunnel, kept alive by a systemd unit that has survived every reboot for weeks. This is the real login-to-systemd flow, the config file, the unit, and why a named tunnel beats a quick tunnel for anything you mean to keep.

Jun 28, 2026·8 min read

Memoization vs prompt caching

How it works

Using it

Where it's safe and where it's dangerous

What I save in practice

When NOT to memoize

Related

More Automation

Programmatic PDF Table Extraction and OCR with Adobe PDF Services REST: The Auth, the Extract Call, and Parsing the Output

I Gave My AI Agent Eyes and Hands on Native Linux Apps With AT-SPI2

Reboot-Proof Cloudflare Named Tunnels: The systemd Setup I Run in Production