AutoKaam Playbook

llama.cpp, the Engine Under Most Local Inference

Compile once, run anything. Where I go when Ollama does not expose the knob I need.

Last reviewed: 2026-05-06

The operator take

llama.cpp is what powers a surprising chunk of the local-LLM ecosystem. Ollama wraps it, LM Studio wraps it, half the desktop apps ship a copy. So when I want full control over quantization choice, custom sampling, KV-cache tweaks, or offloading layers across CPU plus a tiny GPU, I drop down to llama.cpp directly.

The build experience is honest. On my M75q with no GPU, a fresh clone plus make takes about ninety seconds, and I get a single self-contained binary. On the Pi 4B it is slower but worth it, the ARM NEON path gives me about 4 to 5 tokens per second for a 7B q4 model which is usable for a chat agent if you do not mind reading at speaking pace. The first time I tried to build on the Pi I was on a 32-bit OS by accident, the compile succeeded but performance was a third of expected. Switched to the 64-bit Pi OS and the same model jumped to honest tokens per second. So lesson for me, always check uname -m on the Pi.

Quantization choices are the part where llama.cpp earns its place over Ollama. I have spent enough time on Q3_K_M vs Q4_K_M vs Q5_K_M tradeoffs to know that for my grunt-extraction workloads the Q4_K_M variant is the right default, the size and speed gain over Q5 is real and the quality loss is invisible for structured output. For anything where I want better long-context memory, Q5_K_S with a 16K context fits in my 32GB just fine. The Q8 variants I rarely touch, the size cost outweighs the marginal quality on a CPU-bound box.

The CLI is rough by modern standards, the flag matrix is large, the docs trail the code, and the OpenAI-compatible server is a separate executable. None of that has slowed me down for years. The compile-from-source ritual is also a feature, not a bug, you get an exact reproducible binary that does what you asked, no JS layer in between.

Where llama.cpp does not fit is when you want a polished UI, a model browser, or first-class macOS support. Use Ollama or LM Studio there. It also does not replace vLLM for serving, the throughput per dollar on a real GPU is not in the same league. But for the operator who wants to know exactly which sampling temperature was used and which quant version is loaded, this is the canonical tool.

For Indian operators specifically, the case for compiling llama.cpp yourself is the lower-spec hardware path. Pi 5 plus 8GB RAM plus llama.cpp plus Phi-3-mini at Q4 is a real edge LLM, costs me about Rs 6,500 of hardware I already had, and runs a Telegram bot for friends from my balcony. Try doing that with a managed API and you are paying every month forever.

Why it matters in 2026

Most local-LLM tools are wrappers around it. When you need to debug below the wrapper or pick exact quant variants, this is the level you operate at. Also the only path to running a useful model on Pi-class hardware in 2026.

Cost in INR

Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.

Use when

+Pi or other ARM edge hardware where you need every cycle
+Custom quantization choices that Ollama does not expose
+Reproducible inference runs where you want exact binary control
+Single-machine grunt-extraction at the absolute lowest cost

Skip when

xPolished desktop chat use, LM Studio is the better path
xMulti-user serving, vLLM beats it on throughput per dollar
xQuick prototyping, Ollama is faster to set up

Alternatives I would consider

Ollama, the Local Model Runtime I Actually Trust LM Studio, the GUI On-Ramp for People Who Hate Terminals vLLM, Serving Throughput That Defends a GPU Bill

Adjacent in the playbook

Free, open source. Compute cost on consumer hardware is electricity, roughly Rs 4 to Rs 8 per active inference hour on a 65W desktop.

The operator take

Why it matters in 2026

Cost in INR

Use when

Skip when

Alternatives I would consider

Adjacent in the playbook

Ollama, the Local Model Runtime I Actually Trust

LM Studio, the GUI On-Ramp for People Who Hate Terminals

vLLM, Serving Throughput That Defends a GPU Bill

Gemma, the Open Family I Actually Reach For