Every LLM provider has bad hours. Rate limits tighten during launches, regional incidents take endpoints offline, and latency quietly doubles under load long before a status page admits anything. If your application talks to exactly one provider, all of that becomes your incident too.

Fallback is the standard answer, but naive fallback creates its own problems: responses that change personality mid-conversation, retry storms that amplify outages, and failovers you cannot prove happened. This post covers the patterns that hold up in production — and the constraints that make fallback trustworthy rather than just busy.

What should actually trigger a fallback

Not every error deserves a retry elsewhere. In gateway design, three failure classes are commonly evaluated:

5xx responses — the upstream failed; trying the same route again mostly re-queues you into the same incident.
Network errors and timeouts — a connection that cannot complete may justify trying another available route.
Degradation trends — rising failure rates and latency can inform how a gateway ranks candidates before another attempt begins.

Notably absent from this general rule: most 4xx errors. A malformed request will usually be malformed everywhere, and retrying a 401 with a different provider only repeats the same client-side problem. In Router One, eligible upstream failures can retry another available candidate under the gateway's routing policy. Candidate routing can use latency, cost, and reliability signals; the exact public boundary is documented in the routing methodology.

Keep the requested model unchanged

The biggest practical decision is what to fall back to. Swapping the requested model may technically return a response, but it changes tone, formatting behavior, tool-calling quirks, and prompt-sensitivity in the middle of someone's session — a silent quality incident layered on top of the availability one.

Router One's narrower contract is same-model provider fallback: an eligible retry can use another healthy provider route serving the exact model requested by the application. It does not silently substitute a different model variant. If no eligible route for that model is available, the request returns an error instead of a surprise model swap.

model="auto" is a separate path: Router One owns that candidate set and can retry within it. The client still sends the standard OpenAI-compatible body; it does not provide a custom router object or per-request weights.

Budget the handoff

Fallback that doubles your latency is a degradation of its own. Two measurements keep it honest:

Failed-attempt time — how long the upstream attempt ran before the gateway could move on.
Retry time — how long the next candidate took to complete or fail.

End-to-end fallback latency includes both measurements and varies by upstream behavior. Router One does not publish a universal 200ms fallback guarantee, so latency-sensitive applications should evaluate the full observed trace rather than assume a fixed handoff cost.

If you can't verify it, it doesn't exist

The most common fallback failure is silent: the config looks right, and nobody can prove what actually served a request. Router One's per-request trace records the final model and provider route together with usage, cost, latency, and status. It does not promise a customer-visible chain for every failed intermediate attempt, so incident analysis should stay within the evidence the trace actually exposes.

Teams with compliance or evaluation constraints sometimes need the opposite guarantee — that a request never moves between providers. The current public Router One contract does not expose a per-project fallback opt-out, so do not design around that control without confirming a separately documented capability.

The production checklist

Only eligible upstream failures retry another available candidate; do not assume every error is retryable.
Exact-model provider fallback keeps that model unchanged; model="auto" retries within its server-owned candidates.
Measure the failed attempt and retry together; there is no fixed 200ms guarantee.
The trace shows the final model, provider, usage, cost, latency, and status; do not assume every intermediate attempt is exposed.
Do not assume a per-project retry cap or fallback opt-out exists in the public configuration surface.

The LLM fallback and smart routing pages describe how the pieces fit together, and the China latency benchmark shows the measurement discipline behind the routing claims. One base-URL change at router.one puts the whole pattern in front of your existing code.

LLM Fallback Strategies: Production Failover That Holds

What should actually trigger a fallback

Keep the requested model unchanged

Budget the handoff

If you can't verify it, it doesn't exist

The production checklist

Related canonical pages

Related reads

AI Agents in Production: Observability, Cost Caps, Recovery

Multi-Agent Orchestration: Patterns for Production AI Systems

Claude Skills Explained: Building Custom Agent Capabilities