AutoKaam Playbook
DeepSeek Local, the Pricing Disruptor I Mostly Run Hosted
V3 weights are open. I downloaded them, learned the lesson, went back to the API.
Last reviewed:
The operator take
DeepSeek's local story is honest, the weights are open and the team has been consistent about releasing them. I have downloaded V3 base and the V3.2 distill, and after spending an evening trying to run the larger variants on my M75q I went back to the API and have stayed there for three months.
The reason is hardware. DeepSeek-V3 in any usable quant is too big for a 32GB CPU box, you need GPU memory or you are swapping inside two minutes. I tried V3.2-Distill which is sized to fit on consumer hardware, ran it at Q4 on my M75q, and the throughput was honest but the quality fell off a cliff vs the hosted V3.2 model. So the conclusion for me is, DeepSeek wins on pricing precisely because they are running V3 at scale on their own hardware, replicating that locally is not cost-justified for most workloads.
What I learned the hard way is, do not assume "open weights" means "useful at home". For Llama-3-8B or Qwen-2.5-7B or Gemma-2-9B, yes, the local story works on consumer hardware. For DeepSeek-V3-class or Llama-3-70B-class, no, you need real GPU money and at that point the API is cheaper. I tried to run V3.2-Distill on my Pi 4B for the principle of it, the model never even loaded, swap died inside a minute.
The hosted DeepSeek API is where I do spend, the empire uses it for cost-sensitive workflows where I want better quality than Cerebras free tier but lower cost than Anthropic. V3.2 at USD 0.14 input, USD 0.28 output per 1M tokens is the cheapest credible-quality API I know in 2026. For high-volume customer-facing chat I have benchmarked it against Sonnet and the quality gap on my domain is real but not fatal, the cost gap is 20x.
The risk on DeepSeek as a vendor is the off-peak rate volatility. I have seen the quoted price shift between versions and time-of-day. For the empire I treat DeepSeek as a tier-2 vendor, real spend goes through it but I keep Anthropic plus OpenRouter as fallbacks for any flow where DeepSeek goes down or the price changes overnight.
Local-DeepSeek does have one defensible use case, and that is offline reasoning quality on a laptop with a real GPU. For a 24GB or 48GB consumer GPU running V3.2-Distill at Q5 you get real reasoning quality offline, and that is the only setup where I would prefer local-DeepSeek to Ollama-with-Qwen. I do not own that hardware so I cannot speak from operator experience there.
For Indian operators specifically, my recommendation is to forget local-DeepSeek and use the hosted API. The cost is so low that the only argument for local is pure privacy or air-gapping, and at that point Qwen or Gemma on Ollama covers 80 percent of the use case at zero cost.
Why it matters in 2026
DeepSeek pioneered the pricing disruption that pulled API costs down across 2024 to 2026. The local story confirms that frontier-class open weights exist, even if running them at home rarely beats the API math.
Cost in INR
Free open weights. Compute cost on consumer hardware is unfavorable above 8B-class. Hosted API is roughly Rs 12 per 1M input tokens, Rs 23 per 1M output for V3.2.
Use when
- +Air-gapped or strict-privacy environments with real GPU hardware
- +Research and benchmarking against open-weight families
- +Distillation experiments where V3.2-Distill fits on consumer GPUs
Skip when
- xCPU-only consumer hardware, the model is too big
- xCost optimization, the hosted API wins
- xAnything where quality matters and you do not have 24GB+ VRAM
Alternatives I would consider
Read next
Adjacent in the playbook
Free, open source. Compile-time cost on a M75q is under two minutes, on a Pi 4B about ten minutes.
llama.cpp, the Engine Under Most Local Inference
Free, open weights. Compute cost is local hardware electricity, effectively zero for personal use.
Gemma, the Open Family I Actually Reach For
Free open weights. Cerebras Cloud free tier 30 RPM. Cerebras paid tier roughly Rs 50 per 1M input tokens, Rs 100 per 1M output for qwen-3-235B.
Qwen, Where Cerebras Speed Plus Open Weights Actually Compose
Free, open source. Compute cost on consumer hardware is electricity, roughly Rs 4 to Rs 8 per active inference hour on a 65W desktop.