AutoKaam Playbook

Ollama, the Local Model Runtime I Actually Trust

One binary, one model registry, zero cloud dependency. The default I reach for first.

Last reviewed: 2026-05-06

The operator take

I run Ollama every day on my ThinkCentre M75q. It is the layer that lets the empire's gemma-vision MCP work, the layer that backs my screenshot OCR pipeline, and the fastest path I have found to "model in, JSON out" without paying anyone. The install is one curl command, the model registry resolves like Docker Hub for weights, and the OpenAI-compatible API surface means I can wire any tool that already speaks OpenAI to it without writing glue.

What surprised me is how stable it has been. I have been on Ollama through three minor revs in the last two months and I have not had to debug a single launch failure. The team gets the things that matter, model warm-up, GPU offload tiers, context caching across requests. Memory layout on a 32GB box with no GPU is fine for 7B-class models at q4 quantization. I keep gemma-2:2b loaded for the vision pipeline because the latency is reliably under a second per page on my CPU, and qwen-2.5:7b for everything that needs a longer reply.

The trade-off I hit early was the model registry. The first time I pulled gemma-2:9b from my Jio fibre at 50Mbps it stalled at 47% for ten minutes. Switched mirrors via the OLLAMA_HOST_URL trick, sailed through the second time. If you see a stall on an Indian connection, it is almost always the mirror, not your box. The other rough edge is on Pi 4B class hardware where I have to be honest with myself, anything above 2B with a 4K context starts swapping and the box becomes unresponsive in 90 seconds. So on Pi the rule for me is: only the e2b variants, never the dense 7B, and only with --num-ctx 1024 unless I want to babysit it.

Cost in INR is the easy part because the answer is zero, you pay your electricity. For my workload that is roughly Rs 4 to Rs 8 per hour of active inference on the M75q which does not register on my month, and exactly zero on the Pi which I leave running anyway as a Telegram bot. Compare that to the Rs 8 lakh per year I would otherwise pay for a comparable Anthropic spend and the math defends itself.

Where Ollama bites is anything you would call production scale. Concurrency is the missing piece, you get one inference at a time per model, and warming a second model evicts the first. For my single-operator, single-machine empire that is fine, but if you are running an internal tool for a team of fifteen, you want vLLM or you want a managed API. The other place it bites is fine-tuned models, the Modelfile syntax is good for prompts and parameters but it is not a substitute for actually training, and the LoRA merge story is rougher than llama.cpp.

What it does not replace is API quality at the frontier. Gemma 2:9b is good for grunt work, never confuse it with Claude Sonnet for actual reasoning. I use Ollama where I would otherwise burn Cerebras free quota or eat OpenRouter cents, not where the work needs to be right.

Why it matters in 2026

In 2026 the cost ceiling on hosted LLMs went up faster than the local-model floor came down. For high-volume grunt work like extraction, classification, OCR captioning, the math now favors local inference for any operator processing more than about 50K tokens a day. Ollama is the tool that makes that switch happen without re-architecting your stack.

Cost in INR

Free, open source. Compute cost on consumer hardware is electricity, roughly Rs 4 to Rs 8 per active inference hour on a 65W desktop.

Use when

+Bulk extraction or classification across many documents
+Privacy-required workflows where data cannot leave the box
+Local development with no internet, train commute coding
+Anything OpenAI-compatible where you can swap base_url
+Pi-class edge deployments with the e2b model variants

Skip when

xMulti-user concurrent serving, you want vLLM instead
xFrontier reasoning quality, where Sonnet or Opus still pull ahead
xHeavy fine-tuning, the LoRA merge story is rougher than llama.cpp
xModels above your VRAM or RAM ceiling, swap is not a substitute

Alternatives I would consider

llama.cpp, the Engine Under Most Local Inference LM Studio, the GUI On-Ramp for People Who Hate Terminals vLLM, Serving Throughput That Defends a GPU Bill

Adjacent in the playbook

Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.

The operator take

Why it matters in 2026

Cost in INR

Use when

Skip when

Alternatives I would consider

Adjacent in the playbook

llama.cpp, the Engine Under Most Local Inference

LM Studio, the GUI On-Ramp for People Who Hate Terminals

vLLM, Serving Throughput That Defends a GPU Bill

Gemma, the Open Family I Actually Reach For