💻AI Codingintermediate

How I Built a 5-Model LLM Fallback Router

One file every cron job imports, with five providers behind it, so no single 429 or yank kills the empire.

By··7 min read
Ordered LLM provider fallback chain with five tiers and auto-failover

Every automated job in my empire imports one file. Nothing calls a model directly. The content critics, the nightly fanout, the JSON extractors, all of them call chat(system=..., user=..., profile="json") and let a router pick who serves the request. I built it after one too many free-tier providers either yanked a model mid-week or started 429-ing during a burst. If you are a solo founder running LLM automation on free and cheap tiers, and a single provider's bad afternoon has ever taken down your pipeline, this is the pattern that fixes it.

Why a router, not a single API

A single API key is a single point of failure, and on free tiers that failure is not rare, it is scheduled. Two real events from my own logs make the case better than any argument.

On 16 May 2026 I yanked two whole providers in one sitting. Ring 2.6 retired its OpenRouter free tier, and my smoke test came back with a plain string: no longer available as a free model. The same day I pulled the entire Nous Qwen 3.6 family after it burst-drained a subscription bucket in eight hours. If either of those had been my only model, the empire's automation would have gone dark until I noticed and rewrote code. Because they were tiers inside a chain, the router fell through and nothing downstream even blinked.

The thesis is simple. Put the failover in before you flip any traffic, not after the first 3 a.m. page. A cheap or free model is fine to lean on in production on one condition: never single-source it.

The fallback chain I actually run

The router lives at one path and is vendored byte-identical into two CI repos so the pipelines and my local box behave the same. It is not one chain, it is eight profiles, because a low-latency tool call and a long-form prose draft want different orders. Here are the ones that matter:

profile order use case
judgment DeepSeek-V4-Flash, Azure gpt-5.4-mini, Cerebras default critique, contradiction-find
json DeepSeek, Azure gpt-5.4-mini, Cerebras strict JSON (sets response_format=json_object)
literary DeepSeek, Owl Alpha, Azure Kimi K2.6, Cerebras long-form prose
research Owl Alpha, DeepSeek, Azure Kimi 1M-context primary
bulk_prose DeepSeek, Owl Alpha, Cerebras, Azure gpt-5.4-mini nightly fanout, throughput-tilted
fast Laguna XS.2, DeepSeek, Cerebras, Azure gpt-5.4-mini low-latency
code DeepSeek, Owl Alpha, Azure repo refactor, PR review

Read the tiers as a cost-and-reliability ladder, not a quality ranking.

The front is DeepSeek-V4-Flash, the cheap workhorse. It serves the bulk of grunt traffic. I keep it first for the same reason most operators reach for a cheap DeepSeek API in a real routing stack: the per-token cost gap is large enough that you want it taking every request it can handle.

Next are two OpenRouter free models. Owl Alpha is the long-context and prose backup. Laguna XS.2 (poolside/laguna-xs.2:free) is the low-latency option that fronts the fast and tool profiles. The OpenRouter free tier gives roughly 1000 requests a day across free models, which is plenty for backup duty but not something to build a primary on.

Then Azure Foundry, which is credit-covered rather than free. Depending on profile it routes to gpt-5.4-mini for cheap structured work or Kimi K2.6 for prose that wants a thinking model. This is the paid safety net: when the free tiers are all having a bad day, credit keeps the lights on.

Last is Cerebras Qwen (qwen-3-235b-a22b-instruct-2507), the free 1M-tokens-per-day last resort. The daily token budget is generous, but the requests-per-minute window is tighter than the docs imply, so I only want it catching overflow, never leading.

The last time I ran the full smoke across DeepSeek, Owl, Laguna, Azure, and Cerebras, all five came back green. Five for five is the bar I expect before I trust the chain in production.

The routing logic

The core is a list of provider callables per profile. The router tries them in order and returns the first clean response. A response is only clean if it has a populated content field, so a 200 OK with an empty body counts as a failure and falls through.

def chat(system, user, profile="judgment", max_tokens=4096, json_mode=False):
    for call in PROFILE_CHAIN[profile]:
        try:
            text = call(system, user, max_tokens, json_mode)
            if text and text.strip():          # empty content == failure
                return _postprocess(text)        # strip emoji + em-dash, rstrip
        except (RateLimited, ServerError, Timeout, JSONDecodeError):
            continue                             # silent fall-through
    raise AllProvidersDown(profile)

The failover triggers are deliberately broad:

  • HTTP 402, 429, or 500 through 504
  • An empty content field
  • A request timeout
  • A JSON-decode failure on the response body

Two small things save real debugging time. Every request carries a User-Agent: empire-cron/1.0 header, because Cerebras sits behind a Cloudflare WAF that returns error 1010 on a bare urllib agent. And every call logs provider, latency, HTTP code, and an ok-flag to a dated JSONL file, so when a tier starts degrading I can see it in the logs before a job fails.

A --smoke flag probes every provider in one shot. I run it after any provider change and before trusting the chain again.

What broke for me

The nastiest failure was not an outage, it was a model returning success with nothing in it. Azure's Kimi K2.6 is a thinking model. It generates a large reasoning_content block, its internal debate and self-edits, before it writes the final prose into content. At a normal token budget the reasoning eats the whole allocation and the API returns 200 OK with finish_reason: "length" and content: "". It looks like a model failure. It is a budget mismatch.

I measured it during a prose bakeoff on 5 May 2026. A request of 1130 prompt tokens and 16,000 completion tokens for a roughly 1200-word chapter came back with zero characters in content and 66,020 characters in reasoning_content. The finished draft was sitting inside the reasoning, marked Full draft assembly:, followed by self-edit checks that ran out of budget before re-emitting. It took about ten minutes to work out the output was buried, not missing.

The fix is two-part and lives in the router so no caller has to know:

  1. Bump max_tokens to a 32,000 minimum for any K2.6 call, and append to the system prompt verbatim: "Output ONLY the requested content. Do NOT include reasoning, notes, self-edits, or compliance checks."
  2. If content still comes back empty, salvage by parsing reasoning_content for the Full draft assembly: marker. Fragile, last resort only.

This is exactly why the empty-content check is a first-class failover trigger. A naive router that only watches HTTP status would have happily returned an empty string into a production pipeline and silently lost the output.

What it costs

The honest read is that most of this chain is free or credit-covered, and the paid leg only earns its keep when the free tiers fail at once. Rough per-1M-token figures, INR at about Rs 83 to the dollar:

Tier Provider Cost Notes
1 DeepSeek-V4-Flash cheap pay-as-you-go takes the bulk of traffic
2 Owl Alpha (OpenRouter free) $0 ~1000 req/day shared across free models
3 Laguna XS.2 (OpenRouter free) $0 same free-tier pool, low latency
4 Azure Foundry credit-covered gpt-5.4-mini or Kimi K2.6
5 Cerebras Qwen 235B $0 1M tok/day, tight RPM, last resort

The whole point of the ladder is that the expensive request is the rare one. On a normal day DeepSeek and the free OpenRouter models absorb almost everything, and my actual cash outlay rounds to the DeepSeek bill plus credit I was going to spend anyway.

When NOT to build this

Skip the router if you only make a handful of LLM calls a day from one app and an occasional failure is something a human notices and retries. The failover machinery, the per-profile chains, the smoke test, and the logging are overhead that only pays off at volume or when an unattended cron must not silently die.

Skip it too if you need a single model's specific behaviour end to end, a fixed tokenizer or an exact output format, where falling through to a different model would corrupt the result rather than save it. A router trades model-identity guarantees for availability. If you are running nightly automation on free and cheap tiers, that is the right trade. If you are not, one well-monitored API is simpler and fine. The same calculus applies to picking a backend, which is why I argue PocketBase over Supabase for bootstrapped founders: match the machinery to the actual load.

Related