AutoKaam Playbook
Ollama, the Local Model Runtime I Actually Trust
One binary, one model registry, zero cloud dependency. The default I reach for first.
Last reviewed:
The operator take
I run Ollama every day on my ThinkCentre M75q. It is the layer that lets the empire's gemma-vision MCP work, the layer that backs my screenshot OCR pipeline, and the fastest path I have found to "model in, JSON out" without paying anyone. The install is one curl command, the model registry resolves like Docker Hub for weights, and the OpenAI-compatible API surface means I can wire any tool that already speaks OpenAI to it without writing glue.
What surprised me is how stable it has been. I have been on Ollama through three minor revs in the last two months and I have not had to debug a single launch failure. The team gets the things that matter, model warm-up, GPU offload tiers, context caching across requests. Memory layout on a 32GB box with no GPU is fine for 7B-class models at q4 quantization. I keep gemma-2:2b loaded for the vision pipeline because the latency is reliably under a second per page on my CPU, and qwen-2.5:7b for everything that needs a longer reply.
The trade-off I hit early was the model registry. The first time I pulled gemma-2:9b from my Jio fibre at 50Mbps it stalled at 47% for ten minutes. Switched mirrors via the OLLAMA_HOST_URL trick, sailed through the second time. If you see a stall on an Indian connection, it is almost always the mirror, not your box. The other rough edge is on Pi 4B class hardware where I have to be honest with myself, anything above 2B with a 4K context starts swapping and the box becomes unresponsive in 90 seconds. So on Pi the rule for me is: only the e2b variants, never the dense 7B, and only with --num-ctx 1024 unless I want to babysit it.
Cost in INR is the easy part because the answer is zero, you pay your electricity. For my workload that is roughly Rs 4 to Rs 8 per hour of active inference on the M75q which does not register on my month, and exactly zero on the Pi which I leave running anyway as a Telegram bot. Compare that to the Rs 8 lakh per year I would otherwise pay for a comparable Anthropic spend and the math defends itself.
Where Ollama bites is anything you would call production scale. Concurrency is the missing piece, you get one inference at a time per model, and warming a second model evicts the first. For my single-operator, single-machine empire that is fine, but if you are running an internal tool for a team of fifteen, you want vLLM or you want a managed API. The other place it bites is fine-tuned models, the Modelfile syntax is good for prompts and parameters but it is not a substitute for actually training, and the LoRA merge story is rougher than llama.cpp.
What it does not replace is API quality at the frontier. Gemma 2:9b is good for grunt work, never confuse it with Claude Sonnet for actual reasoning. I use Ollama where I would otherwise burn Cerebras free quota or eat OpenRouter cents, not where the work needs to be right.
Why it matters in 2026
In 2026 the cost ceiling on hosted LLMs went up faster than the local-model floor came down. For high-volume grunt work like extraction, classification, OCR captioning, the math now favors local inference for any operator processing more than about 50K tokens a day. Ollama is the tool that makes that switch happen without re-architecting your stack.
Cost in INR
Free, open source. Compute cost on consumer hardware is electricity, roughly Rs 4 to Rs 8 per active inference hour on a 65W desktop.
Use when
- +Bulk extraction or classification across many documents
- +Privacy-required workflows where data cannot leave the box
- +Local development with no internet, train commute coding
- +Anything OpenAI-compatible where you can swap base_url
- +Pi-class edge deployments with the e2b model variants
Skip when
- xMulti-user concurrent serving, you want vLLM instead
- xFrontier reasoning quality, where Sonnet or Opus still pull ahead
- xHeavy fine-tuning, the LoRA merge story is rougher than llama.cpp
- xModels above your VRAM or RAM ceiling, swap is not a substitute
Alternatives I would consider
Read next
Adjacent in the playbook
Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.
llama.cpp, the Engine Under Most Local Inference
Free for personal use. Commercial use license is in flux for 2026, treat as not licensed for production.
LM Studio, the GUI On-Ramp for People Who Hate Terminals
Free, open source. Compute cost via RunPod is about Rs 42 per hour for L4 (24GB VRAM) and Rs 250 to Rs 800 per hour for A100 or H100.
vLLM, Serving Throughput That Defends a GPU Bill
Free, open weights. Compute cost is local hardware electricity, effectively zero for personal use.