AutoKaam Playbook
llama.cpp, the Engine Under Most Local Inference
Compile once, run anything. Where I go when Ollama does not expose the knob I need.
Last reviewed:
The operator take
llama.cpp is what powers a surprising chunk of the local-LLM ecosystem. Ollama wraps it, LM Studio wraps it, half the desktop apps ship a copy. So when I want full control over quantization choice, custom sampling, KV-cache tweaks, or offloading layers across CPU plus a tiny GPU, I drop down to llama.cpp directly.
The build experience is honest. On my M75q with no GPU, a fresh clone plus make takes about ninety seconds, and I get a single self-contained binary. On the Pi 4B it is slower but worth it, the ARM NEON path gives me about 4 to 5 tokens per second for a 7B q4 model which is usable for a chat agent if you do not mind reading at speaking pace. The first time I tried to build on the Pi I was on a 32-bit OS by accident, the compile succeeded but performance was a third of expected. Switched to the 64-bit Pi OS and the same model jumped to honest tokens per second. So lesson for me, always check uname -m on the Pi.
Quantization choices are the part where llama.cpp earns its place over Ollama. I have spent enough time on Q3_K_M vs Q4_K_M vs Q5_K_M tradeoffs to know that for my grunt-extraction workloads the Q4_K_M variant is the right default, the size and speed gain over Q5 is real and the quality loss is invisible for structured output. For anything where I want better long-context memory, Q5_K_S with a 16K context fits in my 32GB just fine. The Q8 variants I rarely touch, the size cost outweighs the marginal quality on a CPU-bound box.
The CLI is rough by modern standards, the flag matrix is large, the docs trail the code, and the OpenAI-compatible server is a separate executable. None of that has slowed me down for years. The compile-from-source ritual is also a feature, not a bug, you get an exact reproducible binary that does what you asked, no JS layer in between.
Where llama.cpp does not fit is when you want a polished UI, a model browser, or first-class macOS support. Use Ollama or LM Studio there. It also does not replace vLLM for serving, the throughput per dollar on a real GPU is not in the same league. But for the operator who wants to know exactly which sampling temperature was used and which quant version is loaded, this is the canonical tool.
For Indian operators specifically, the case for compiling llama.cpp yourself is the lower-spec hardware path. Pi 5 plus 8GB RAM plus llama.cpp plus Phi-3-mini at Q4 is a real edge LLM, costs me about Rs 6,500 of hardware I already had, and runs a Telegram bot for friends from my balcony. Try doing that with a managed API and you are paying every month forever.
Why it matters in 2026
Most local-LLM tools are wrappers around it. When you need to debug below the wrapper or pick exact quant variants, this is the level you operate at. Also the only path to running a useful model on Pi-class hardware in 2026.
Cost in INR
Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.
Use when
- +Pi or other ARM edge hardware where you need every cycle
- +Custom quantization choices that Ollama does not expose
- +Reproducible inference runs where you want exact binary control
- +Single-machine grunt-extraction at the absolute lowest cost
Skip when
- xPolished desktop chat use, LM Studio is the better path
- xMulti-user serving, vLLM beats it on throughput per dollar
- xQuick prototyping, Ollama is faster to set up
Alternatives I would consider
Read next
Adjacent in the playbook
Free, open source. Compute cost on consumer hardware is electricity, roughly Rs 4 to Rs 8 per active inference hour on a 65W desktop.
Ollama, the Local Model Runtime I Actually Trust
Free for personal use. Commercial use license is in flux for 2026, treat as not licensed for production.
LM Studio, the GUI On-Ramp for People Who Hate Terminals
Free, open source. Compute cost via RunPod is about Rs 42 per hour for L4 (24GB VRAM) and Rs 250 to Rs 800 per hour for A100 or H100.
vLLM, Serving Throughput That Defends a GPU Bill
Free, open weights. Compute cost is local hardware electricity, effectively zero for personal use.