Building a Custom MCP Server in Python: Claude Reaches My Stack
How I wrote a handful of Python MCP servers so Claude Code can do PDF work, local OCR, and memory recall without leaving the terminal

Claude Code is sharp inside the terminal and useless one inch past it. Ask it to extract a table from a scanned PDF, OCR a screenshot sitting in /tmp, or look up what you decided about a project three weeks ago, and it shrugs. The model has no hands on your machine, your files, or the private tools you actually run. You end up copy-pasting context in, copy-pasting output back, and the agent loop dies.
MCP is the fix, and the part nobody tells you is that writing your own server is small. I run a self-hosted stack of little AI services on a Linux box, and I gave Claude Code direct access to three things it could not touch on its own: Adobe-grade PDF operations, a local vision model for OCR, and full-text recall over my own notes. Three Python files. Each one is under 500 lines. This is exactly how they work.
What MCP actually is
Model Context Protocol is a thin contract between an AI client (Claude Code, in my case) and a server you control. The server advertises a list of tools. Each tool is a function with a name, a docstring, and typed arguments. The client reads that list, decides when a tool fits the task, calls it with arguments, and gets a string back. No model weights, no magic. The server is just a process that speaks the protocol over stdin and stdout.
Two facts make this practical. First, the transport for a local server is plain stdio, so there is no port, no web server, no auth dance. Claude Code launches your script as a subprocess and talks to it over pipes. Second, the Python SDK gives you a decorator, so a tool is a normal function with @mcp.tool() on top. The docstring you write becomes the description the model reads to decide whether to call it. Write a vague docstring and the model calls the tool at the wrong moment. Write a precise one and it behaves.
The server is the bridge between a capable model and the boring, specific, private parts of your stack. Nothing more.
What I built
I run three custom servers, all registered in one config file. Each exists because Claude Code genuinely could not do the job without it.
| Server | Tools | Why it is local |
|---|---|---|
adobe-pdf |
9: create, export, extract, compress, ocr, merge, protect, linearize, status | Wraps the Adobe PDF Services REST API. Govt forms, invoices, study material, deck exports all flow through it. |
gemma-vision |
6: analyze_image, ocr, describe_screenshot, extract_data, compare_images, ask_gemma | Local Ollama vision model. OCR and screenshot debugging without burning Claude's context or sending images to a cloud. |
recall |
8: recall, recall_history, recall_stats, get_entity, who_works_on, list_decisions, find_supersedes, traverse_links | Full-text search over 2011 of my own markdown files. Lets Claude look up a person, project, or past decision before asking me. |
The adobe-pdf server is the workhorse. Its extract_pdf tool preserves table structure and reading order that pdfplumber and PyPDF can not match, which matters when the input is a scanned government form. The gemma-vision server points at an Ollama daemon on localhost:11434 running gemma3:4b-it-qat, so screenshot OCR happens on my own GPU and never leaves the box. The recall server shells out to a small FTS5 index over my notes and returns ranked snippets, so the agent stops asking me "who is X" when X is documented in ten files.
The pattern across all three is identical, which is the point. Learn it once.
A minimal MCP server in Python
Here is a real, runnable server. It uses FastMCP from the official mcp package, exposes two tools, and runs over stdio. This is the same skeleton my three production servers are built on.
#!/usr/bin/env python3
"""Toolbox MCP server: a minimal stdio server with two real tools."""
import shutil
import subprocess
from pathlib import Path
from mcp.server.fastmcp import FastMCP
mcp = FastMCP(
"toolbox",
instructions=(
"Local utility tools for the developer's machine. "
"Use word_count to count words in a text file; "
"use disk_free to report free space on a path."
),
)
@mcp.tool()
def word_count(file_path: str) -> str:
"""Count words, lines, and characters in a UTF-8 text file.
Args:
file_path: Absolute path to the text file.
Returns: A one-line summary, or an error string starting with ERROR.
"""
p = Path(file_path).expanduser().resolve()
if not p.is_file():
return f"ERROR: file not found: {p}"
text = p.read_text(encoding="utf-8", errors="replace")
words = len(text.split())
lines = text.count("\n") + 1
return f"OK: {words:,} words, {lines:,} lines, {len(text):,} chars in {p.name}"
@mcp.tool()
def disk_free(path: str = "/") -> str:
"""Report free disk space on the filesystem holding the given path.
Args:
path: Any path on the target filesystem (default: root).
Returns: Free / total in GB, or an error string.
"""
try:
usage = shutil.disk_usage(Path(path).expanduser())
except OSError as e:
return f"ERROR: {type(e).__name__}: {e}"
gb = 1024 ** 3
return f"OK: {usage.free / gb:.1f} GB free of {usage.total / gb:.1f} GB on {path}"
if __name__ == "__main__":
mcp.run()
A few things are deliberate. Every tool returns a string, and on failure it returns a string that starts with ERROR instead of raising. The model handles a returned error far better than a stack trace, and it can decide to retry or report. Every argument is typed and described in the docstring, because the docstring is the only thing the model sees when it decides whether to call the tool. Paths are resolved to absolute and existence-checked, because an MCP server has no working directory you can rely on. The mcp.run() call with no arguments defaults to stdio transport, which is exactly what a locally launched server wants.
Install the dependency in a virtual environment, not system Python. On a modern Ubuntu box, system pip is PEP 668 locked and refuses to touch system packages:
python3 -m venv ~/.venvs/mcp
source ~/.venvs/mcp/bin/activate
pip install "mcp[cli]"
Run it once by hand to confirm it does not crash on import. A clean stdio server started from the terminal will sit silently waiting for protocol input, which is correct. Hit Ctrl-C to exit.
Wiring it into Claude Code
Claude Code reads MCP servers from a JSON config. The shape is the same whether you use a project-level .mcp.json or the user config. Each server entry is just a command and its arguments. Here are my three real servers, lightly trimmed:
{
"mcpServers": {
"gemma-vision": {
"command": "python3",
"args": ["/home/you/.claude/mcp-servers/gemma-vision/server.py"],
"env": {
"OLLAMA_URL": "http://localhost:11434",
"GEMMA_MODEL": "gemma3:4b-it-qat",
"GEMMA_TIMEOUT": "120"
}
},
"adobe-pdf": {
"command": "python3",
"args": ["/home/you/.claude/mcp-servers/adobe-pdf/server.py"],
"env": {}
},
"recall": {
"command": "python3",
"args": ["/home/you/.claude/mcp-servers/recall/server.py"],
"env": {}
}
}
}
The command is the interpreter, the args array points at your script with an absolute path, and env injects configuration. I keep the Ollama URL and model name in env so I can repoint gemma-vision at a different model without editing code. Credentials never go in this file. The adobe-pdf server reads its OAuth secret from a separate file in a vault directory locked to mode 600, and the config carries nothing sensitive. After editing the config, restart Claude Code so it relaunches the subprocesses, then run /mcp to confirm each server is listed and its tools loaded.
If you are already running Claude Code on Linux and want the base install dialed in first, the Claude Code setup walkthrough covers that ground.
What broke for me
The honest one. I tried to give recall a semantic search layer using a local embedding model, bge-m3, served through Ollama on my home GTX 1660. Short test strings embedded fine. Then real-length input, anything past roughly two thousand characters, started returning HTTP 500: failed to encode response: json: unsupported value: NaN. Every time, regardless of content.
My first guess was an input-length bug, so I truncated the chunks. It did not help. The actual cause is hardware: the GTX 16xx line has crippled, half-rate FP16, and the model was being served in F16. The float16 matmuls overflow to NaN, the server can not serialize NaN to JSON, and you get a 500 that looks like a length bug but is not. The fix was to force the embedder onto CPU by passing "options": {"num_gpu": 0} in the embed request, which costs about a fifth of a second per query and returns clean vectors. The vision model survived untouched because gemma3:4b-it-qat is quantization-aware, not raw F16.
The lesson generalizes. Any F16 model on that GPU can NaN on non-trivial input, and short-string smoke tests hide it completely. When you wire a new local model into a server, test it on real-length input before you trust it, not on a three-word ping.
What it costs
The servers themselves are free. The mcp package is open source, the code is yours, and stdio servers consume nothing when idle because they only run when Claude Code launches them.
The backends are where rupees show up, and mine are deliberately cheap. The gemma-vision server runs entirely on local hardware, so OCR and screenshot analysis cost nothing per call and never touch a metered API. The adobe-pdf server rides the Adobe PDF Services free tier at 500 document transactions a month, which covers a steady stream of forms and invoices at zero cost. The recall server is a local SQLite FTS5 index, so it costs nothing and returns in about a second. Across the three, the marginal cost of a tool call is effectively zero, which is the entire reason to self-host them instead of renting an equivalent. Google's official analytics-mcp server, which I also run, installs through pipx and is free as well; the only thing it touches is your own GA data.
The hidden saving is context. Letting a local vision model do OCR keeps large images out of Claude's context window, and letting recall answer "who is X" stops the agent from asking me and burning a turn on it.
When NOT to build a custom server
Not every gap needs a server. If the work is a single shell command the agent already has permission to run, a server is overhead. If an official MCP server exists for the thing you want, like Google's analytics server, use it instead of reinventing it. If the tool needs interactive auth on every call, or a long-lived stateful session, stdio gets awkward and you are better off with a small local HTTP API the agent calls through a thin wrapper.
Build a server when there is a private capability the agent will reach for repeatedly: your own data, a paid API you want gated behind a free-tier budget, local hardware like a GPU model, or an internal tool with its own auth. This is the line. Repeated, private, and worth a clean tool boundary. Everything else is a one-off command.
The payoff is that Claude Code stops being a chat box that happens to live in a terminal and becomes something that can actually operate your stack. The first server takes an afternoon. The second takes twenty minutes, because the pattern never changes.
Related
More AI Coding

Claude Code Subagents in Practice: Fork Flag, Cache Leak, Worktree Trap
Fanning out subagents in Claude Code looks free until you hit the cap or your forks clobber each other's commits. These are the real fixes I learned running fanouts: the fork env flag that shares the parent's cache, the WebFetch cache leak, and the worktree pattern for parallel writers.

I Gave My AI Agents a Memory With SQLite FTS5 (No Vector DB)
Most agent-memory setups reach for Pinecone or pgvector by reflex. I put 2000+ markdown files behind SQLite FTS5 with BM25 ranking, and my agents now answer their own 'who is X' questions in under a second for zero tokens. Here is the schema, the query, and the one place lexical search loses.

How I Built a 5-Model LLM Fallback Router
I run a production router that fails over across DeepSeek, two OpenRouter free models, Azure Foundry, and Cerebras. By the end you will have the exact ordered chain I use, the failover triggers, and the one model bug that silently ate my output for ten minutes.