I Gave My AI Agents a Memory With SQLite FTS5 (No Vector DB)
2000+ markdown notes, BM25 retrieval in under a second at zero API cost, queried by agents before they bother me.

An agent that re-asks a question you already answered is worse than one with no memory at all, because you trusted it. Mine used to do this. I run a self-hosted stack of coding and ops agents on a Linux box, and they share a folder of markdown notes I have built up over months: people, projects, decisions, things that broke and how I fixed them. One day an agent stopped mid-task to ask me who a particular collaborator was. The answer was sitting in ten files. It just had no way to look.
The reflex fix is a vector database. Embed every note, stand up Pinecone or pgvector, do similarity search. I did not do that. I put the whole corpus behind SQLite FTS5 with BM25 ranking, and the agents now query it before they interrupt me. Retrieval is under a second and costs zero tokens. This is the build, and the honest account of where lexical search wins and where it does not.
Why not a vector DB
Reach for vectors when the query and the document share meaning but no words. "How do I stop the model returning blank output" should find a note titled "empty content field counts as a failover trigger." Lexical search misses that. Embeddings catch it. That is real and it matters.
For a personal corpus of a couple thousand markdown notes, it is mostly not the problem you have. My queries are names, project slugs, error codes, config flags, the literal token I half-remember. When an agent wants the file on a specific person or a specific env variable, it is searching for that exact string, and BM25 over an inverted index is not just good enough, it is better. A cross-encoder cannot score a random ID it has never seen; the exact lexical hit can.
The cost ledger settled it. A vector setup means an embedding model in the loop, an API bill or a GPU resident in memory, a second store to keep in sync, and a rebuild every time a note changes. FTS5 ships inside SQLite. No server, no embeddings, no network call. The index is one file on disk. Retrieval is a query against a C extension that has been hardened for twenty years. For an agent tool that fires on every session, "instant and free" beats "smart and slow."
The rule I landed on: lexical first, vectors only where you measure them earning their cost. More on that measurement below, because I did eventually add a semantic path and it taught me exactly where the line sits.
The build
FTS5 is a built-in SQLite extension. No install. You create a virtual table, insert your text, and query it with MATCH. BM25 ranking is built in too, exposed as a bm25() function that returns a relevance score (lower is more relevant in SQLite's convention, so you sort ascending).
Here is the indexer. It walks a directory, reads every markdown file, and stores the path and body in an FTS5 table.
#!/usr/bin/env python3
"""Index a folder of markdown into a SQLite FTS5 table for BM25 search."""
import sqlite3
from pathlib import Path
DB = Path("knowledge.db")
NOTES = Path("notes") # your markdown corpus
def build_index():
con = sqlite3.connect(DB)
con.execute("DROP TABLE IF EXISTS docs")
# external-content-free FTS5 table: stores path + body, tokenizes body
con.execute("""
CREATE VIRTUAL TABLE docs USING fts5(
path UNINDEXED, -- returned, not searched
body, -- tokenized + searched
tokenize = 'porter unicode61'
)
""")
rows = []
for f in NOTES.rglob("*.md"):
if "/archive/" in str(f): # skip retired notes
continue
rows.append((str(f), f.read_text(encoding="utf-8", errors="ignore")))
con.executemany("INSERT INTO docs (path, body) VALUES (?, ?)", rows)
con.commit()
print(f"indexed {len(rows)} files into {DB}")
con.close()
if __name__ == "__main__":
build_index()
Two choices worth calling out. The porter unicode61 tokenizer folds case, splits on unicode boundaries, and stems English so "indexing" matches "index." And path is marked UNINDEXED, so it rides along in results without polluting the search. Index roughly two thousand markdown files this way and the database lands in the low tens of megabytes and builds in a few seconds.
The query is where BM25 does the work:
def search(query: str, limit: int = 10):
con = sqlite3.connect(DB)
con.row_factory = sqlite3.Row
rows = con.execute(
"""
SELECT
path,
bm25(docs) AS score,
snippet(docs, 1, '<<', '>>', ' ... ', 12) AS snip
FROM docs
WHERE docs MATCH ?
ORDER BY score -- BM25: lower = more relevant
LIMIT ?
""",
(query, limit),
).fetchall()
con.close()
return [(r["path"], r["score"], r["snip"]) for r in rows]
for path, score, snip in search("fallback router empty content", limit=5):
print(f"{score:7.2f} {path}\n {snip}\n")
bm25(docs) is the ranking function, weighting term frequency against how rare the term is across the corpus, so a query word that appears in two files outranks one that appears in two hundred. snippet() returns a short window of the matched text with the hit wrapped in markers, which is exactly what an agent wants: enough to judge relevance without reading the whole file. The MATCH operator also takes FTS5 query syntax, so "exact phrase", term1 AND term2, and prefix matches like index* all work without extra code.
The whole engine is a table, an insert loop, and one SELECT. No model, no vectors, no service.
Wiring it so agents query before asking
A retrieval function does nothing if the agent does not call it. The wiring that changed everything was exposing it as a tool the agent discovers on its own. In my case that is an MCP server, the same small-Python-server pattern I use to give Claude Code direct access to my stack. The server wraps the search() function above and advertises a handful of tools: a knowledge search, a session-history search, and an index-stats call.
The lever is the tool description. The agent reads it to decide when to reach for the tool, so it has to say, in plain terms, use this before asking the human. Mine reads close to: "search my own notes for a person, project, or past decision; use before asking who or what something is." Once that description was right, the behaviour shifted without any prompting in the system message. The agent hits a name it does not recognise, calls the tool, gets ten ranked snippets in about a second, and keeps working. It stopped interrupting me to ask things the corpus already knew.
This is the whole point of the pattern. The memory is not something you paste into context at the start of a session and hope survives. It is a tool the agent pulls from on demand, mid-task, the moment it hits an unknown. Pull beats push for this, because the agent only retrieves what the current step actually needs, and a couple thousand notes never have to fit in the context window.
One operational nicety: the index rebuilds itself on every call unless told to skip, picking up new and changed files. Freshness is on-demand, not a cron you forget to run.
What broke, and what I learned
The lexical-only version had a real ceiling, and I only saw its exact shape after I built the semantic path to compare against. I stood up a local embedding model (bge-m3, running on CPU to dodge a GPU quirk), embedded the corpus, and measured both retrievers against a fixed set of test queries with a precision-at-3 gate. The numbers were blunt:
| retriever | keyword queries (p@3) | paraphrase queries (hit@3) |
|---|---|---|
| FTS5 BM25 | 93.5% | 38% |
| pure vector (bge-m3) | 97.4% | 50% |
FTS5 is strong on the queries that are literally how I search: names, slugs, flags. It is weak on paraphrase, the case where I describe a thing in words the note does not use. Vectors flip it: better on paraphrase, and it turned out a touch better on keyword too, but they came with a cost FTS5 never has.
Then the part I did not expect. I tried fusing the two with reciprocal-rank fusion, the textbook hybrid move, and it regressed below both. When one ranker is strong on both axes, blending a weaker lexical ranker into it drags a noise file up the list. Fusing a strong ranker with a weak one is not free; it can be negative.
So the shipped design is a router, not a blend. A small regex checks whether the query looks lexical: an ID, an error code, a snake_case flag, a hash, a quoted phrase. If yes, it goes to FTS5 BM25, which nails the exact hit a vector model cannot. Everything else goes to the vector path. There was a genuine failure mode underneath this: pure vector almost never returns zero hits, so an exact-token query for a specific ID would get plausible-looking neighbours instead of the one correct file, and the FTS5 fallback never fired. The router is what closes that gap. The cheap lexical engine did not get replaced. It got promoted to the thing that handles the queries it is actually best at.
If I were starting over on a smaller corpus, I would ship FTS5 alone and stop. The semantic layer earns its place only once you can measure a paraphrase miss rate you actually care about.
What it costs
Nothing. SQLite ships with Python. FTS5 is compiled in. The index is a file on local disk in the low tens of megabytes for a couple thousand notes. A query returns in under a second and makes zero API calls and burns zero tokens, so an agent can hit it on every uncertain entity without a second thought about cost. That last property is what makes the pull-on-demand pattern viable: a retrieval you can call freely is a retrieval the agent will actually use. The semantic path I added later does cost a resident embedding model in memory, which is the whole reason it is gated behind the router and not the default.
When NOT to use this
Skip pure FTS5 if your queries are mostly semantic: a support bot over docs where users ask in their own words, or any case where the match is meaning, not shared tokens. There the 38% paraphrase rate is the headline number and you want embeddings leading, the way I argue for matching machinery to load when I pick PocketBase over Supabase for bootstrapped founders.
Skip the self-managed file-and-index route entirely if your "corpus" is already in a database with a search product attached, or if it is large enough and multi-user enough that you need a real search service with replication and access control. FTS5 is a single file on one box. That is its strength for a personal agent memory and its ceiling for anything that has to scale out. For thousands of markdown notes that one operator owns, queried by agents that must not stop to ask, it is the right tool, and the cheapest one in the room.
Related
More AI Coding

Building a Custom MCP Server in Python: Claude Reaches My Stack
Claude Code is sharp until it hits the edge of your machine and your private tools. I wrote three small MCP servers in Python to close that gap. Here is the real pattern, the real gotcha that bit me, and what it costs.

Claude Code Subagents in Practice: Fork Flag, Cache Leak, Worktree Trap
Fanning out subagents in Claude Code looks free until you hit the cap or your forks clobber each other's commits. These are the real fixes I learned running fanouts: the fork env flag that shares the parent's cache, the WebFetch cache leak, and the worktree pattern for parallel writers.

How I Built a 5-Model LLM Fallback Router
I run a production router that fails over across DeepSeek, two OpenRouter free models, Azure Foundry, and Cerebras. By the end you will have the exact ordered chain I use, the failover triggers, and the one model bug that silently ate my output for ten minutes.