Automationadvanced

Where a Cheap DeepSeek API Actually Fits in a Real Routing Stack

I run a multi-model router across my whole empire. DeepSeek is one tier in it, not the whole thing. Here is the cost math, the exact profiles, and the two jobs I refuse to give it.

By··9 min read·Reviewed
Cut AI API Costs 98% With DeepSeek V3.2, A Production Migration Guide, automation on AutoKaam

DeepSeek at a fraction of frontier pricing is real, and the "cut your AI bill by 98 percent" headline is technically true if you only look at per-token cost. But a full migration to it is the wrong move, and I learned that running a multi-model router across every automated job in my empire. DeepSeek is one tier inside that router. It is the cheap workhorse for grunt work. It is not the writer and it is not the planner, and the difference is where all the money and all the reliability live.

This is the version written by someone who actually routes production traffic through a cheap model every night, not someone summarising a pricing page.

The mental model: a router, not a swap

I have a single file, llm_router.py, that every cron job, content critic, and extraction worker in the empire imports. The full build of that 5-model fallback router is its own piece; here I focus on where the cheap tier fits inside it. Nothing calls a model directly. They call chat(system=..., user=..., profile="json") and the router decides who serves the request. I built it in May 2026 after one too many free-tier providers either yanked their model or started 429-ing during a burst.

The whole point is that the cheap model sits at the front of a chain with paid models behind it. If DeepSeek 402s, 429s, times out, or returns an empty content field, the router silently falls through to the next provider. The caller never knows. This is the only sane way to lean on a cheap API in production: never single-source it, and put the failover in before you flip any traffic, not after the first 3 a.m. page.

Pricing reality

The cost gap is the reason this is worth doing at all.

Model Input (per 1M tokens) Output (per 1M) Approximate INR
GPT-5.4 $5.00 $20.00 Rs 415 in / Rs 1,660 out
Claude Opus 4.6 $15.00 $75.00 Rs 1,245 / Rs 6,225
Claude Sonnet 4.5 $3.00 $15.00 Rs 249 / Rs 1,245
DeepSeek (cheap tier) $0.14 $0.28 Rs 12 / Rs 23

Roughly 50x cheaper than GPT and 250x cheaper than Opus. On a job that fires thousands of times a night, that gap is the entire business case. On a job that fires fifty times and ships to a reader, the gap is a rounding error and you should not be optimising it.

The profiles I actually run

The router does not have one model list. It has eight named profiles, each with its own fallback order, because the right model depends entirely on the job class. These are the ones in production:

Profile Order Use case
judgment DeepSeek -> Azure GPT-5.4-mini -> Cerebras content critique, council votes, contradiction-find
json DeepSeek -> Azure GPT-5.4-mini -> Cerebras strict JSON extraction (forces response_format=json_object)
literary DeepSeek -> Owl Alpha -> Azure Kimi -> Cerebras long-form prose
research Owl Alpha -> DeepSeek -> Azure Kimi 1M-context primary
bulk_prose DeepSeek -> Owl Alpha -> Cerebras -> Azure nightly fanout, throughput-tilted
fast Laguna -> DeepSeek -> Cerebras -> Azure low-latency
tool Laguna -> DeepSeek -> Azure GPT-5.4-mini tool-call routing
code DeepSeek -> Owl Alpha -> Azure repo refactor, PR review

Notice the pattern. DeepSeek leads the chain on json, judgment, bulk_prose, and code because those are high-volume, structured, forgiving jobs where the cost gap compounds. On research it is the second leg behind a 1M-context model, and on fast and tool it sits behind a faster model because latency beats cost there. The cheap model is everywhere, but it is rarely alone and it is not always first.

When DeepSeek wins

These are the jobs I route to the cheap tier without hesitation:

  • High-volume batch classification, the single biggest win
  • Structured extraction with strict JSON output
  • Field parsing from rich sources (pricing, features, changelogs)
  • Dedup-against-corpus and relevance scoring
  • Translation, including Indic languages
  • First-pass critique where a paid model does the final call

The common thread is volume and tolerance for the occasional miss. When a classifier mislabels one ticket in a thousand and the eval set already told you the error rate, the 50x cost saving wins every time.

When it does not, and the two jobs I refuse to give it

This is the part the migration guides skip. There are jobs where cost is not the constraint, and routing them to a cheap model is how you quietly poison your output.

Reader-facing prose. I do not let any cheap open-weights model write copy that ships to a human. The reason is blunt: my content sites earn from traffic and trust, and slop is a blast-radius risk, not a cost line. My paid writing tier has effectively no marginal per-call cost because of how the subscription is billed, so there is zero reason to trade quality for a saving that does not exist on a job firing a few dozen times. Cheap models stay in the router as a fallback for a total outage of the good one, never as the primary writer. That rule cost me one rewrite of my own content architecture to learn.

Long structured generation through a reasoning model. This one bit me directly. I once tested a reasoning-mode model on a content writer with a dense 9,233-token prompt full of rules and brand constraints. It reasoned for 11,999 tokens straight, hit the token cap mid-thought, and returned zero characters of content. Cost per failed call: about four cents, for nothing. The lesson: classify the task first. Extract-and-transform from a rich source can take a reasoning model. Generate-from-thin-source needs a non-reasoning model, and you raise max_tokens to at least three times your target before the first smoke run, with the full real prompt, never a toy one. A toy-prompt smoke hides the reasoning-budget blowout completely.

The rest of the avoid list is the standard one: complex multi-step planning, deeply nested agentic tool chains where a paid model is more reliable, and frontier-quality creative work.

Wiring the failover

The router falls through on HTTP 402, 429, and 500 through 504, on a timeout, on a JSON-decode failure, and, importantly, on an empty content field. That last trigger matters because some thinking models return a populated reasoning channel and an empty content field, which a naive client treats as success. The router treats empty content as a failure and moves to the next leg.

The minimal version of the pattern, if you are building your own, is a tier list and a loop:

MODEL_TIERS = [
    "deepseek/deepseek-chat",        # cheap, try first
    "anthropic/claude-sonnet-4.5",   # paid, fallback
    "openai/gpt-5.4",                # premium, last resort
]

def call_with_failover(messages, max_attempts=3):
    for model in MODEL_TIERS[:max_attempts]:
        try:
            r = client.chat.completions.create(
                model=model, messages=messages, timeout=60,
            )
            content = r.choices[0].message.content
            if not content:                       # empty-content = failure
                raise ValueError("empty content")
            log.info("served=%s", model)          # log who served it
            return content
        except Exception as e:
            log.warning("%s failed: %s", model, e)
    raise RuntimeError("all models failed")

OpenRouter is the simplest way to hold all of these behind one key, because swapping a model is a string change, not a code rewire. The free tier there gives 1000 requests a day, which covers a surprising amount of nightly grunt work at zero burn.

Two production details that are easy to miss. First, set the timeout to 60 seconds, not the 30-second default. A cheap model with no India point of presence runs slower from here, and the default trips during peak. Second, log which provider served every call. Every request through my router writes a line to a daily JSONL with the provider, latency, and status code. That log is how I see DeepSeek starting to flap during a burst before it shows up as a quality dip, and it is how I know whether the paid fallback is actually getting exercised or just sitting there.

One more gotcha from running this against multiple free providers: some of them put a Cloudflare WAF in front that rejects a bare urllib user-agent with an error before you even reach the model. Set a real User-Agent header on every request and that class of mystery failure disappears.

The cost math, where it actually compounds

The headline saving is real but it only matters at scale. Take a nightly batch that fires the model 500 times, each call around 2K input and 500 output tokens, with a stable system prefix that caches well.

Model Per-call cost (cached prefix) 500 calls/night
DeepSeek (cheap tier) ~Rs 0.30 Rs 150
Claude Sonnet 4.6 ~Rs 7 Rs 3,500
GPT-5.4 ~Rs 12 Rs 6,000

Rs 150 a night against Rs 3,500 to Rs 6,000. Over a month that is the difference between a free hobby and a real bill. But run the same arithmetic on a job firing fifty times a day to render reader-facing text and the monthly gap is a few hundred rupees, which is not worth one unit of quality risk. Route by where the cost compounds, not by reflex.

Prompt caching stacks on top of this. A cheap model typically charges a fraction of the input price for a cached read, so a stable system prefix with many queries behind it multiplies the base saving. Design the prefix to be stable from day one and the cache hits accumulate for free. For deterministic calls you fire repeatedly with the same inputs, the next lever is to skip the model entirely: memoize the LLM calls to a content-addressed cache and replay the answer at zero tokens.

Latency is the hidden cost

DeepSeek's pricing is the headline and latency is the quiet tax. From an India fibre line the cheap tier on OpenRouter free lands around 4 to 17 seconds for a full completion, against single-digit seconds for a paid model with a nearby point of presence. For a background cron or an agent loop, nobody is watching the cursor and the cost wins outright. For an interactive chat where a user is staring at a spinner, that extra latency shows. The split that works: conversational endpoints go to the paid tier, batch and overnight endpoints go to the cheap one. You take the cost saving where it compounds and the speed where it is seen.

Bottom line

A cheap DeepSeek-class API is one of the best automation levers available in 2026, but only as a tier, never as a religion. Put it at the front of a router with paid models behind it on failover. Lead the chain with it on high-volume classification, extraction, and structured jobs. Keep it out of reader-facing prose and out of long generation through a reasoning model. Log every call so you see it flap before your users do. Do that and the 50x cost gap turns into a real monthly saving without ever turning into a 3 a.m. rollback.