AutoKaam Playbook

vLLM, Serving Throughput That Defends a GPU Bill

Production inference for teams. I only run it on RunPod, never local, and the math works.

Last reviewed:

The operator take

vLLM is the only local-inference engine where I have actually justified the GPU cost on a per-call basis. I do not own a GPU, my M75q has integrated graphics that is useless for inference, so vLLM lives on rented compute for me. The pattern I have settled on is RunPod L4 instances at about USD 0.5 per hour, which is roughly Rs 42 per hour at the rate I last fetched, and I spin them up only when batch work demands it.

The reason vLLM beats Ollama for serving is PagedAttention. On a fresh L4 with Llama-3-8B I see roughly 2,000 tokens per second of throughput when I am running 50 concurrent requests, which is somewhere around 50x what I would get out of Ollama on the same hardware. For batch extraction across thousands of documents, that throughput delta is the difference between an hour of GPU billing and a day of it.

What I learned the hard way is the start-up cost. vLLM warm-up plus model load on an L4 takes about three minutes from cold. If I forget that and submit a job with a five-second timeout I get nothing. So my pattern now is, kick off the spin-up, wait for the health endpoint to return 200, then submit the batch. Also: I forgot to stop a RunPod instance once after a 9PM batch finished, woke up at 6AM with eight hours of L4 billed, the bleed was about Rs 320 which is small money but the precedent was bad. Now I always set a stop-after-N-idle-minutes hook and a separate maximum-runtime cap.

Quantization on vLLM is a different story than llama.cpp. AWQ-Int4 is the production sweet spot for 8B-class models, I can fit Llama-3-8B-AWQ in a single L4 with 24GB VRAM and still get the throughput numbers above. For 70B-class models the L4 is not enough, you want an A100 or H100 which puts you in the Rs 250 to Rs 800 per hour range, and at that point you should question whether self-serve is right at all.

The case where vLLM beats hosted APIs cleanly is high-volume bulk inference with strict data residency. For empire workloads I evaluated this for the kaam-tracker invoice classification pipeline, ~50,000 docs per day. Anthropic Sonnet would cost me roughly USD 75 per day. RunPod L4 with vLLM running Mistral-7B-Instruct would cost USD 12 plus my own time. I went with Sonnet anyway because Mistral 7B was not accurate enough on my domain, but the math is real and for the right workload vLLM wins.

Where it does not fit is one-off prototypes, anything where you cannot guarantee batch volume, or any case where the model needs to be Sonnet-quality. vLLM makes seven-billion-parameter models cheap, it does not make them smart. For Indian operators specifically I would suggest treating vLLM as the "scale up after the prompt is right" tool, not the starting point.

Why it matters in 2026

Self-hosted inference for teams went from speculative to defensible in 2024. By 2026, AWQ quantization plus L4-class GPUs put 8B-model serving inside Rs 50 per hour with 50x Ollama throughput. For high-volume bulk work, the math now beats hosted APIs in a meaningful chunk of cases.

Cost in INR

Free, open source. Compute cost via RunPod is about Rs 42 per hour for L4 (24GB VRAM) and Rs 250 to Rs 800 per hour for A100 or H100.

Use when

  • +Batch inference with thousands of documents per run
  • +Internal tool for a team of 5 to 50 users
  • +Workloads with strict data-residency or privacy needs
  • +Open-weight 7B to 13B models where quality matches the task

Skip when

  • xOne-off prototypes, the warm-up cost destroys the math
  • xFrontier-quality reasoning, hosted Anthropic or OpenAI is right
  • xNo-GPU machines, this is not a CPU engine
  • xSingle-user occasional use, Ollama is right

Alternatives I would consider