Single-agent loops are easy to start with and easy to grow out of. The first time you watch one agent try to research a topic, write a report, fact-check itself, and produce final output in one chat — and watch it get distracted, lose context, and consume 200K tokens — you understand why people start splitting work across multiple agents.
Multi-agent orchestration is not one pattern; it's four. Sequential, parallel, hierarchical, and human-in-the-loop each solve different problems and have different failure modes. This post walks through when to use each, what to watch for in production, and how the orchestration layer (the part that holds Run/Step state, retries, budgets, traces) glues it together.
Why Split at All
Before the patterns, the motivating constraint. Splitting work across agents pays off when one of these things is true:
- Different roles want different prompts. A "research" agent's system prompt is different from a "writing" agent's. Putting both into one prompt doubles its size and weakens both.
- Different roles want different models. Cheap fast model for triage, expensive deep model for the hard step. Routing decisions get cleaner when each step has its own model identity.
- Steps want to run in parallel. Three independent searches should not run sequentially.
- Steps want different tool access. A code-review agent should not have web search; a research agent should not have shell access.
- You want to checkpoint. Stopping after step 1 to inspect, then resuming step 2 — much easier when steps are explicit agent boundaries.
If none of those apply, a single-agent loop is fine. Most production systems eventually grow into at least one of these.
Pattern 1: Sequential
The simplest extension of single-agent. Output of agent A becomes input of agent B.
[user input] → Researcher → Outline → Writer → Draft → Reviewer → Final
When to use:
- Writing tasks where research, drafting, and review are genuinely different skills.
- Pipelines where each step's output is small and well-shaped (a JSON object, a list, a paragraph).
- Anything where you want to inspect intermediate output for debugging.
Failure modes:
- Information loss between agents — Researcher's tone disappears in the Writer's output. Mitigation: pass more context downstream than you think you need; let the next agent decide what to ignore.
- Compounding errors — a wrong fact in step 1 propagates. Mitigation: add a verification step between research and writing for high-stakes outputs.
- Cost stacking — three agents at $X each costs $3X per task. Use cheaper models for narrower roles.
A real production pipeline at Router One scale uses sequential for content pipelines: a researcher agent on Gemini 3.1 Pro (long context, cheap), a writer on Claude Sonnet 4.6 (good prose), a reviewer on Claude Opus 4.7 (high judgment). Three different models, three different prompts, one composed task.
Pattern 2: Parallel (Fan-Out / Fan-In)
When a task has independent sub-tasks, run them at the same time and combine the results.
┌── Search A ─┐
[user input] → ─┼── Search B ─┼─→ Synthesizer → Final
└── Search C ─┘
When to use:
- Multi-source research (search A, search B, internal docs, etc.).
- Multi-perspective analysis (review the same code as a security expert, a performance expert, a reliability expert).
- Independent transformations of the same data.
Failure modes:
- Wasteful concurrency — running 10 parallel searches when one would have answered the question. Mitigation: have a triage step decide if fan-out is needed.
- Combination explosion — synthesizer gets so much input it loses focus. Mitigation: have each parallel agent return a structured, compact result rather than a long blob.
- Cost — parallelism multiplies fixed-cost calls (system prompts, schemas). Mitigation: cache the system prompt at the routing layer; budget each parallel branch.
Parallel patterns are where prompt caching matters. If three agents share a long system prompt, caching it once means only the divergent input is paid for per branch. The numbers can be 5-10x cheaper than the naive version.
Pattern 3: Hierarchical (Manager + Workers)
A manager agent decomposes a task and dispatches sub-tasks to specialist workers. Workers report back; manager decides whether to dispatch more, replan, or finish.
[user input] → Manager
├─→ Worker 1 (returns)
├─→ Worker 2 (returns)
└─→ Manager replans
└─→ Worker 3 (returns) → Final
When to use:
- Tasks where the structure is unknown until the manager has thought about them.
- Long-horizon work where a top-level plan has to adjust based on intermediate findings.
- Code agents that work on a feature spanning multiple files: manager decides which files; workers edit them.
Failure modes:
- Manager-as-bottleneck — every worker output flows through the manager, which builds up context and slows down. Mitigation: keep manager prompts short and have workers produce structured summaries, not full transcripts.
- Loop divergence — manager keeps dispatching workers without converging. Mitigation: enforce a maximum step count and a budget; require each worker to produce progress, not just activity.
- Lost provenance — at step 14, you don't remember why step 7 happened. Mitigation: log a structured trace of every dispatch with the manager's reasoning.
This is where Router One's L5 orchestration layer earns its keep. Run/Step abstraction means each manager dispatch is a first-class step with its own tokens, latency, and cost; you can replay the trace, pause the run, inject a human review, or kill it when budget hits the ceiling. See How to run AI agents in production for the production wrapper around this.
Pattern 4: Human-in-the-Loop
At specific points, the agent pauses and waits for a human decision before continuing.
[user input] → Agent → [proposes plan] → human approves → Agent → ...
When to use:
- High-stakes decisions (sending email, deploying code, charging customers).
- Tasks with non-determinism the model cannot resolve (which tone? which feature first?).
- Internal tools where the human is the actual user.
- Anything regulated.
Failure modes:
- Blocking forever — agent waits for an approval that never comes. Mitigation: timeout + escalation policy.
- "Approve" theatrics — humans rubber-stamp without reading. Mitigation: only ask for approval at meaningful gates.
- Lost context — when the human comes back hours later, the agent has lost cached state. Mitigation: persistent runs with explicit pause/resume support; budget windows that survive across pauses.
Pause/resume is one of the harder things to implement well. Fragile if you store agent state in memory; durable if you persist Run/Step state and resume by re-reading. The Router One L5 layer handles this — every run can be paused, resumed, or cancelled, with a complete audit trail of what happened in between.
Choosing Among the Patterns
A rough decision flow:
| Question | Pattern |
|---|---|
| Is it pure pipeline? Each step has clear input → output? | Sequential |
| Are sub-tasks truly independent and runnable concurrently? | Parallel |
| Is the structure unknown without thinking, and will adjust mid-flight? | Hierarchical |
| Is there a step where a human must decide? | Human-in-the-loop (combine with above) |
In practice, most production systems combine two or three patterns. A code review agent might be: hierarchical (a top agent dispatches review by file), parallel (three reviewers per file in parallel), with a human-in-the-loop at "ready to merge."
Cost and Latency in Multi-Agent Systems
The two things that consistently bite teams scaling multi-agent systems:
Hidden cost compounding. Single-agent cost is roughly linear in conversation length. Multi-agent cost is the sum of all agents' inputs and outputs, including system prompts duplicated across calls. A naive 3-step sequential pipeline can be 3-5× the cost of a single-agent equivalent.
Mitigations:
- Aggressive prompt caching at the gateway layer. Router One caches system prompts and shared context across requests; without that, system prompt cost dominates.
- Cheap-then-expensive routing. Use Haiku/Flash for triage and structuring; reserve Opus/Pro for the hard step.
- Truncate as you go. Pass forward what's needed, not full transcripts.
Latency stacking. Sequential agents add their latencies; a 3-step pipeline at 5s/step is 15s end-to-end. Users feel that.
Mitigations:
- Parallelize wherever the data dependency allows.
- Stream from the final agent so the user sees progress.
- Pre-warm caches when you know a sequential step is coming.
The AI model routing explanation covers per-step model selection. The LLM cost reduction guide covers gateway-level cost levers.
Observability: Trace, Don't Print
Single-agent debugging is "read the conversation." Multi-agent debugging is impossible without a structured trace. Your orchestration layer should give you:
- Every step's model, prompt, output, tokens, latency, and cost.
- The parent-child relationship between dispatches.
- The reason a step happened (manager's plan, error retry, human approval).
- The replay capability — given a trace, can you reproduce the path?
This is exactly the L7 observability layer that Router One ships out of the box. For more on what observability means at agent scale see How to run AI agents in production.
Failure Recovery
Multi-agent systems fail differently from single-agent loops:
- Partial failure: 9 of 10 parallel branches succeed, one times out. Decide upfront: aggregate what you have, or fail the whole run? In production, partial-aggregation is usually right with a warning.
- Mid-pipeline failure: step 4 of 7 fails. Retry that step alone; don't restart the whole run. Requires Run/Step persistence.
- Cascade failure: an upstream agent's bad output corrupts every downstream. Cheap mitigation: schema validation between steps. The next agent rejects malformed input rather than reasoning about garbage.
A useful operational rule: every step must declare what counts as "success" before being dispatched. The orchestrator can then enforce it.
FAQ
Should I use LangGraph, Mastra, Inngest, or build it myself? Most teams start with one of the frameworks. LangGraph for graph-shaped flows, Inngest for event-driven, Mastra for typed agent definitions. The framework choice matters less than getting the orchestration concepts right; you can swap frameworks later.
Do I need Router One specifically for orchestration? Frameworks handle in-process orchestration. Router One handles the cross-cutting concerns: per-step routing decisions, gateway-level caching, persistent Run/Step state, traces, budgets, observability. They compose: your framework calls Router One per agent step.
How many agents is too many? Above ~10 distinct agent roles, the system becomes hard to reason about. If you're getting there, look for opportunities to merge — two specialists with overlapping descriptions can usually become one.
What about agentic frameworks like AutoGPT? These are early prototypes of hierarchical patterns. They popularized the idea but their default loop dynamics tend to diverge in production. The patterns above are what survives.
Streaming through multiple agents? Most teams stream only the final agent's output. Intermediate agents produce structured outputs that don't benefit from streaming. The exception is very long sequential pipelines where the user benefits from seeing progress; in that case, stream short progress messages between steps.
Conclusion
Multi-agent systems are not magic; they're a way to encode different roles, different models, and different access boundaries explicitly. Sequential, parallel, hierarchical, and human-in-the-loop are the four shapes that hold up. Cost and latency compound; observability and Run/Step persistence are no longer optional.
For the gateway-level support that makes multi-agent systems affordable in production, see how Router One's model routing keeps each step on the right model and how skills and MCP make individual agents in the system materially more useful.