AutoKaam Playbook

llama.cpp, the Engine Under Most Local Inference

Compile once, run anything. Where I go when Ollama does not expose the knob I need.

Last reviewed:

The operator take

llama.cpp is what powers a surprising chunk of the local-LLM ecosystem. Ollama wraps it, LM Studio wraps it, half the desktop apps ship a copy. So when I want full control over quantization choice, custom sampling, KV-cache tweaks, or offloading layers across CPU plus a tiny GPU, I drop down to llama.cpp directly.

The build experience is honest. On my M75q with no GPU, a fresh clone plus make takes about ninety seconds, and I get a single self-contained binary. On the Pi 4B it is slower but worth it, the ARM NEON path gives me about 4 to 5 tokens per second for a 7B q4 model which is usable for a chat agent if you do not mind reading at speaking pace. The first time I tried to build on the Pi I was on a 32-bit OS by accident, the compile succeeded but performance was a third of expected. Switched to the 64-bit Pi OS and the same model jumped to honest tokens per second. So lesson for me, always check uname -m on the Pi.

Quantization choices are the part where llama.cpp earns its place over Ollama. I have spent enough time on Q3_K_M vs Q4_K_M vs Q5_K_M tradeoffs to know that for my grunt-extraction workloads the Q4_K_M variant is the right default, the size and speed gain over Q5 is real and the quality loss is invisible for structured output. For anything where I want better long-context memory, Q5_K_S with a 16K context fits in my 32GB just fine. The Q8 variants I rarely touch, the size cost outweighs the marginal quality on a CPU-bound box.

The CLI is rough by modern standards, the flag matrix is large, the docs trail the code, and the OpenAI-compatible server is a separate executable. None of that has slowed me down for years. The compile-from-source ritual is also a feature, not a bug, you get an exact reproducible binary that does what you asked, no JS layer in between.

Where llama.cpp does not fit is when you want a polished UI, a model browser, or first-class macOS support. Use Ollama or LM Studio there. It also does not replace vLLM for serving, the throughput per dollar on a real GPU is not in the same league. But for the operator who wants to know exactly which sampling temperature was used and which quant version is loaded, this is the canonical tool.

For Indian operators specifically, the case for compiling llama.cpp yourself is the lower-spec hardware path. Pi 5 plus 8GB RAM plus llama.cpp plus Phi-3-mini at Q4 is a real edge LLM, costs me about Rs 6,500 of hardware I already had, and runs a Telegram bot for friends from my balcony. Try doing that with a managed API and you are paying every month forever.

Why it matters in 2026

Most local-LLM tools are wrappers around it. When you need to debug below the wrapper or pick exact quant variants, this is the level you operate at. Also the only path to running a useful model on Pi-class hardware in 2026.

Cost in INR

Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.

Use when

  • +Pi or other ARM edge hardware where you need every cycle
  • +Custom quantization choices that Ollama does not expose
  • +Reproducible inference runs where you want exact binary control
  • +Single-machine grunt-extraction at the absolute lowest cost

Skip when

  • xPolished desktop chat use, LM Studio is the better path
  • xMulti-user serving, vLLM beats it on throughput per dollar
  • xQuick prototyping, Ollama is faster to set up

Alternatives I would consider