Your AI agent works in dev. Requests go out, responses come back. The agent reads files, calls an LLM, writes code, maybe calls the LLM again to verify. It feels solid. Then you deploy to production, and reality hits.

Three things happen in the first week. First, invisible cost accumulation — one agent instance enters a reasoning loop and burns through $200 in tokens before anyone notices. Second, silent failures with no trace — a customer reports that the agent "stopped working" and your team has no way to reconstruct what happened because there is no run-level trace, just scattered request logs. Third, cascading outages with no recovery — your primary LLM provider experiences degraded performance for 40 minutes, and every agent in your system hangs waiting for responses that never arrive.

These are not hypothetical scenarios. They are the operational reality of running AI agents at scale, and they apply equally to coding agents, customer support chatbots, internal automation workflows, and any system where an LLM makes decisions autonomously.

Production AI agents need three pillars: observability to understand what agents are doing, cost control to prevent budget disasters, and fault recovery to keep running when providers fail. This guide covers all three with concrete patterns, configuration examples, and a production readiness checklist.

Why AI Agents Are Harder to Operationalize Than Traditional APIs

Before diving into solutions, it is worth understanding why standard API operations playbooks fall short for AI agents. The failure modes are fundamentally different.

Non-deterministic behavior. Traditional APIs are deterministic — same input, same output, same cost. AI agents are not. The same user prompt can produce different reasoning paths, different tool call sequences, and different token counts on every invocation. This makes capacity planning, cost forecasting, and anomaly detection significantly harder. You cannot simply multiply request count by average cost and get a reliable estimate.

Compound costs. A traditional API call has a fixed cost. An AI agent call can trigger 2, 10, or 50 LLM invocations depending on the complexity of the task, the agent's reasoning strategy, and whether it enters a retry or self-correction loop. A single user request to "refactor the authentication module" might generate 15 LLM calls across planning, code generation, verification, and revision steps. The cost of that one user action is unknowable in advance and varies by an order of magnitude between runs.

Multi-provider dependency. Most production agent systems rely on multiple LLM providers — perhaps Claude for complex reasoning, GPT for certain tool use patterns, and a smaller model for classification tasks. Each provider is an independent failure domain with its own uptime, latency characteristics, rate limits, and degradation patterns. The combinatorial failure space grows with every provider you add.

Stateful workflows. Unlike stateless API calls, agents maintain context across multiple steps. A coding agent might be midway through a 12-step refactoring task when a provider outage hits. Simply retrying the request is not an option — you need to resume from the last successful step, which means you need to have persisted the intermediate state in the first place. Most teams discover this requirement after their first production outage.

Pillar 1 — Observability: Know What Your Agents Are Doing

The first thing you lose when deploying agents to production is visibility. In development, you watch the agent work in real time — you see each LLM call, each tool invocation, each decision point. In production, you have logs. And if those logs are not structured correctly, you have nothing.

Request-Level Tracing Is Not Enough

Traditional API observability centers on individual requests: this request took 340ms, returned a 200, consumed 1,247 tokens. For agents, this granularity is insufficient because a single agent action comprises multiple requests with dependencies and intermediate state.

What you need is Run-level tracing — a hierarchical trace that captures every step an agent takes within a single user-initiated action, including LLM calls, tool invocations, token counts, costs, and timing:

Run #r_abc123 — "Refactor auth module"
├─ Step 1: claude-4-sonnet → Plan (312 in / 1,847 out) $0.008 — 1.2s
├─ Step 2: tool:file_read → auth.ts (no LLM cost)
├─ Step 3: claude-4-sonnet → Generate code (4,102 in / 3,291 out) $0.031 — 3.4s
├─ Step 4: tool:file_write → auth.ts (no LLM cost)
├─ Step 5: claude-4-sonnet → Verify (2,847 in / 892 out) $0.014 — 1.1s
└─ Total: 7,261 in / 6,030 out — $0.053 — 5.7s

This trace tells you everything: the agent made three LLM calls and two tool calls, the code generation step was the most expensive, total cost was $0.053, and the entire run completed in 5.7 seconds. Without this structure, you are left piecing together individual request logs and guessing which ones belong to the same agent action.

Metrics That Matter

Once you have run-level traces, you can derive the metrics that actually matter for production operations:

Cost per run — not cost per request. A run that makes 12 LLM calls at $0.01 each costs $0.12, but if you only track individual requests you see twelve cheap calls and miss the aggregate picture.
P95 latency per run — end-to-end time from run start to completion. Individual request latency is less relevant when the user is waiting for a 10-step agent workflow.
Success/failure rate by agent type — different agents have different baseline reliability. Your code generation agent might fail 8% of the time while your summarization agent fails 0.5%. Tracking them separately lets you set meaningful alerts.
Token consumption trends — are your agents getting more expensive over time? A prompt drift or model update might increase token usage by 30% without any code change on your side.
Model distribution — which models are actually handling your traffic? If your routing is configured but 95% of traffic is still going to the most expensive model, something is wrong.

Dashboard vs. Logging

Structured logging to stdout is a reasonable starting point. But logs alone cannot answer aggregate questions like "what was our total agent cost yesterday?" or "which project had the highest P95 run latency this week?" without a query layer on top. You need aggregated dashboards that surface trends, anomalies, and breakdowns by project, API key, model, and time window.

Platforms like Router One capture these traces automatically at the gateway level — every request is logged with model, token, cost, latency, fallback, and status metadata, and the dashboard surfaces aggregated metrics without custom instrumentation. Your application code stays clean, and your operations team gets the visibility they need from day one.

For more on how an AI API gateway provides this observability layer, see What is an AI API Gateway.

Pillar 2 — Cost Control: Budget Guardrails for Autonomous Systems

AI agents are semi-autonomous by design. They decide how many LLM calls to make, which tools to invoke, and when to retry or self-correct. This autonomy is what makes them useful — and what makes them dangerous to your budget.

Why Agents Need Budget Limits More Than Traditional APIs

A traditional API call has a bounded cost. You know the price per request, you can estimate monthly spend from projected volume, and a bug that doubles your request count doubles your cost — painful but predictable.

Agents break this model. A coding agent asked to "refactor the entire codebase" might loop for 50+ iterations, each involving multiple LLM calls. A research agent that cannot find a satisfactory answer might retry with progressively longer contexts, increasing token consumption exponentially. A customer support agent handling an adversarial user might enter a clarification loop that generates hundreds of unnecessary tokens.

The cost of a single agent run is unbounded unless you explicitly bound it. Without budget guardrails, one bad run can consume more budget than the entire system normally uses in a day.

Layered Budget Architecture

Effective cost control requires limits at multiple levels. A single global budget is too coarse — it protects the organization but does not prevent one project from starving another. Per-request limits are too fine — they prevent long-running but legitimate agent workflows from completing.

The right approach is layered:

{
  "budgets": {
    "organization": { "monthly_limit_usd": 5000 },
    "project": { "monthly_limit_usd": 1000 },
    "api_key": { "daily_limit_usd": 50 },
    "per_run": { "max_cost_usd": 10 }
  },
  "alerts": {
    "soft_warning": 0.8,
    "hard_cutoff": 1.0
  }
}

Each layer catches a different class of problem. The per-run limit prevents individual runaway agents. The API key daily limit caps exposure from a single integration or developer. The project monthly limit keeps team spending within allocated budgets. The organization limit is the last line of defense against systemic issues.

This layered JSON is an application architecture, not a single Router One configuration. Router One enforces maxSpend, rateLimit, and tokenLimitTpm on each API key. Your application or orchestration framework owns per-run, project, organization, approval, and alert policies; create separate Router One keys when those workloads need isolated gateway spend.

The soft warning at 80% gives teams time to investigate and adjust before the hard cutoff kicks in. When the hard cutoff is reached, you have options.

What Happens When a Limit Is Hit

The right response depends on the agent type and the business context:

Block the request — safest option for non-critical workloads. The agent receives a budget-exceeded error and stops.
Downgrade to a cheaper model — for agents where completion matters more than quality. Route to a smaller model that costs 10-20x less per token and let the agent finish.
Alert and continue — for revenue-critical agents where stopping is worse than overspending. Send an alert to the operations team and allow the run to complete.
Queue for human approval — for high-value, high-cost operations. Pause the agent, notify a human, and resume only with explicit approval.

Most production systems use a combination: per-run limits that block, daily limits that downgrade, and monthly limits that alert.

Router One enforces the API key's maxSpend cap at the gateway layer before forwarding a billable request. The higher-level limits and responses described above must be implemented by your application or orchestration framework.

For more cost optimization strategies, see 5 Ways to Reduce Your LLM API Costs.

Pillar 3 — Fault Recovery: When Things Go Wrong

LLM providers are not as reliable as traditional cloud infrastructure. Major providers each experience 2 to 5 degraded performance incidents per month, ranging from elevated latency to full outages lasting minutes to hours. If your agent system depends on a single provider with no failover, you are accepting multiple incidents per month as baseline.

Automatic Failover Patterns

There are three common approaches to provider failover, each with different tradeoffs:

Hot standby. You designate a primary model and one or more fallback models. All traffic goes to the primary under normal conditions. When the primary degrades, traffic switches to the first fallback. Simple to implement, but the fallback models receive no production traffic during normal operation, which means you have less confidence in their behavior under load.

Active-active. Traffic is distributed across multiple providers simultaneously, weighted by routing scores. When one provider degrades, its weight drops and the others absorb the traffic. More resilient than hot standby because all providers are continuously exercised, but more complex to manage.

Graceful degradation. When the primary model is unavailable, instead of failing over to an equivalent model at a different provider, you fall back to a simpler, faster, cheaper model that handles a subset of the agent's capabilities. The agent continues to function but with reduced capability — like a car switching to limp mode.

A practical failover configuration looks like this:

routing:
  primary: claude-4-sonnet
  fallback:
    - gpt-4.1
    - gemini-2.5-pro
  failover_trigger:
    error_rate_threshold: 0.1
    latency_threshold_ms: 5000
  recovery:
    probe_interval_seconds: 30
    min_healthy_probes: 3

The failover_trigger defines when to switch: either when the error rate exceeds 10% or when latency exceeds 5 seconds. The recovery section defines how to restore the primary: send a health probe every 30 seconds, and only reintroduce the primary after 3 consecutive healthy responses. This prevents the system from flapping between providers during intermittent failures.

State Recovery for Multi-Step Agents

Failover handles the provider side, but agent workflows have a separate recovery challenge: what happens to a multi-step run when a failure occurs midway through?

Consider a coding agent that is on step 7 of a 12-step refactoring task. Steps 1 through 6 have completed successfully — files have been read, plans have been made, code has been generated. Then the LLM call in step 7 fails. Without state persistence, you have two bad options: retry the entire 12-step workflow from scratch (wasting time and money) or fail the entire run (wasting the work already done).

The better approach is to persist Run/Step state at each checkpoint. When a failure occurs, the agent can resume from the last successful step. The persisted state includes the Run's current status, the output of each completed step, and any accumulated context. This is the same concept as checkpointing in distributed computing — and it is just as essential for production agents.

Circuit Breakers and Retry Budgets

Naive retry logic is one of the most common causes of agent cost blowups. An agent encounters a transient error, retries with exponential backoff, and the retries themselves trigger more errors — or succeed but at enormous token cost because the retry includes progressively longer context windows.

Circuit breakers prevent this cascade. After a configurable number of failures within a time window, the circuit opens and all subsequent requests to that provider fail fast without actually being sent. This protects both your budget and the degraded provider from retry storms.

Retry budgets complement circuit breakers by capping the total cost of retries. Instead of limiting the number of retries (which does not account for cost variance between attempts), you set a maximum dollar amount for retry attempts per run. Once the retry budget is exhausted, the run fails cleanly with a clear error.

Router One's supported candidate-routing paths use EWMA latency plus reliability observations. Eligible exact-model failures can retry another healthy provider route for that model, while model="auto" retries within its server-owned candidates. End-to-end fallback time depends on the failed attempt and next provider; no fixed millisecond handoff is promised.

For the technical details of how EWMA-based routing works, see AI Model Routing Explained.

Production Readiness Checklist

Before deploying agents to production, validate each of these items. They are ordered roughly by priority — the first five are non-negotiable, the remaining five are strongly recommended.

If you check all ten items, your agent system is ready for production traffic. If you are missing even one of the first five, you are operating with significant blind spots that will cause problems at scale.

Router One covers the gateway parts of these pillars — per-call metadata traces, per-key maxSpend enforcement, and provider fallback — out of the box. Your application or orchestration framework still owns aggregate budgets, alerts, durable Run/Step state, tool execution, and multi-step workflow recovery. If you are building agents for production, start at router.one. Browse available models at /models, or see how teams integrate with coding agents at /claude-code-china.

Conclusion

The gap between a demo agent and a production agent is not in the prompts or the model selection — it is in the operational infrastructure around it. A demo agent calls an LLM and returns a result. A production agent calls an LLM through a system that traces every step, enforces budget limits at multiple levels, and automatically recovers from provider failures.

Calling LLMs directly is a black box. You get a response, but you have no ledger of what happened, no trace of how the agent reasoned, and no governance over what it spent. Running agents through proper infrastructure gives you a ledger, a trace, and governance — the same accountability you expect from every other production system.

As agents become more autonomous — making more decisions, calling more tools, running longer workflows — these three pillars become more critical, not less. The agent that runs five LLM calls today might run fifty tomorrow as capabilities expand. The infrastructure that supports it needs to scale accordingly.

Build the operational foundation now. Your agents — and your budget — will thank you.

How to Run AI Agents in Production: Observability, Cost Control, and Fault Recovery