AI & Industry

AI Agent Orchestration at Scale — What Actually Works in Production

Patterns and hard lessons from running multi-agent systems at 80M+ user scale: routing, fallback chains, context management, and why most agent architectures fail.

Laurent Goudet · February 21, 2026 · 10 min read

Everyone is building AI agents. Almost nobody is building them in a way that survives contact with production traffic. I know because I’ve spent the last year wiring up multi-agent systems at Freelancer.com — MCP servers, agentic workflows, LLM routing — serving a platform of 80+ million users. What follows are the patterns that actually work and the mistakes I’ve watched teams (including mine) make along the way.

This isn’t a tutorial. It’s an architecture post. If you’re evaluating whether to build agentic systems or trying to figure out why yours keeps breaking, this is for you.

I. The God Agent Anti-Pattern

The first thing every team builds is a single agent with a massive system prompt that tries to do everything. It can query databases, write code, send emails, search documents, and update project management tools. It feels magical in a demo. It falls apart in production within a week.

The failure mode is predictable: as the system prompt grows past 4,000 tokens, the model starts ignoring instructions. Tool calls become unreliable. The agent picks the wrong tool 15% of the time, which means one in seven user interactions produces garbage. And because everything runs through one context window, a failure in the email tool corrupts the state for the database query that follows.

The fix is decomposition. Instead of one god agent, you build specialized agents with narrow responsibilities and a lightweight router that dispatches to the right one. This is the same pattern that made microservices work — not because small things are inherently better, but because failure boundaries matter.

Decompose, Don't Consolidate

A monolithic agent with 20 tools and a 6,000-token system prompt will fail unpredictably.

Specialized agents with 3-5 tools each, coordinated by a router, are dramatically more reliable.

The router itself can be an LLM (a cheap, fast one like Haiku) or a deterministic classifier. What matters is that each agent operates in a clean context with a focused task.

Agent Architecture

Four-layer architecture: Router → Agents → Orchestrator → MCP
L1
Router
Hybrid deterministic + LLM classifier
L2
Specialized Agents
3–5 tools each, focused system prompts
L3
Orchestrator
Workflow state, context, retries
L4
MCP Servers
Typed tool interfaces for external systems

II. The Routing Layer

Routing is the most underrated part of an agentic system. The router decides which agent handles a request, and if it gets this wrong, nothing downstream matters. We’ve iterated through three generations of routing at this point.

Generation 1: keyword matching. If the user mentions “database” or “query”, route to the SQL agent. This works for about 60% of requests. The other 40% are ambiguous (“show me the latest numbers” — is that a database query or a dashboard request?).

Generation 2: LLM-based classification. A fast model (Haiku) reads the request and outputs a routing decision. This gets you to ~90% accuracy, but adds 200-400ms of latency to every request. Fine for async workflows, painful for interactive ones.

Generation 3: hybrid. Deterministic rules handle the obvious cases (structured commands, explicit tool mentions). The LLM classifier handles the ambiguous rest. A confidence threshold gates the decision — if the classifier is below 70% confidence, it asks the user to clarify instead of guessing. This gets us to ~97% routing accuracy with p50 latency under 100ms for the common cases.

Hybrid Routing Flow

Deterministic rules first, LLM classifier for the ambiguous rest
User Input
Deterministic Rules
Match?
match
Route to Agent
no match
LLM Classifier
Confidence >70%?
yes
Route to Agent
no
Ask to Clarify
Deterministic path
LLM fallback

III. Context Management Is the Hard Problem

Context is the silent killer of agentic systems. Every agent call consumes context window, and context windows are finite. At 200K tokens, you’d think there’s plenty of room. In practice, a multi-step workflow with tool results can burn through 50K tokens in four turns.

The naive approach is to dump everything into the context: full conversation history, all tool results, all intermediate reasoning. This works for the first few interactions and then quality degrades as the model starts losing track of what’s relevant. We’ve measured this — accuracy on our internal benchmarks drops by 12% when context exceeds 40K tokens, even when all the needed information is present.

What works is aggressive context pruning. Each agent gets only what it needs: the current task, relevant tool results, and a compressed summary of prior steps. The orchestration layer maintains a structured state object that tracks what’s been done and what’s pending, and each agent receives a focused slice of it. Think of it as the agentic equivalent of passing function arguments instead of global variables.

Context Is Not Free

Even with 200K token windows, quality degrades well before you hit the limit. Our measurements show 12% accuracy drop past 40K tokens in multi-tool workflows.

Treat context like memory allocation: give each agent the minimum it needs, prune aggressively between steps, and maintain structured state outside the LLM’s context window.

Context Management

Full context bloat vs. pruned context per agent

Full Context

~50K tokens
  • Full conversation history
  • All tool results
  • All intermediate reasoning
  • Prior agent outputs

Pruned Context

~8K tokens per agent
  • Current task only
  • Relevant tool results
  • Compressed summary
  • Structured state slice
12% accuracy drop past 40K tokens

IV. MCP: The Tool Integration Layer

Model Context Protocol (MCP) is what makes agent orchestration practical at scale. Before MCP, every tool integration was a custom adapter: write a function, define the schema, handle errors, serialize results. For five tools, this is manageable. For fifty, it’s a maintenance nightmare.

MCP standardizes the interface between agents and tools. An MCP server exposes typed tools with descriptions, parameter schemas, and structured responses. The agent doesn’t need custom code for each tool — it reads the MCP schema and knows how to call it. We run MCP servers for Phabricator (code review and task management), internal databases, deployment pipelines, and monitoring dashboards.

The real win isn’t the protocol itself — it’s composability . When you add a new MCP server, every agent in the system can immediately use it. When you fix a bug in a tool’s error handling, every agent benefits. This is the same leverage that REST APIs gave web applications: a standard interface that makes integration cheap.

V. Fallback Chains and Failure Handling

LLM APIs go down. It happens more often than you’d like — rate limits, timeouts, transient 500s, model degradation during provider incidents. If your agent system has a single point of failure at the LLM layer, your uptime is capped at the provider’s uptime, which is not 100%.

We run a fallback chain: primary model → secondary model → simplified prompt with a smaller model → graceful degradation message. Each step in the chain has its own timeout and circuit breaker. If the primary model (Opus) times out after 30 seconds, we fall back to Sonnet with a simplified prompt. If Sonnet also fails, we try Haiku with a minimized prompt. If everything is down, the user gets a clear error message, not a hung request.

The key insight is that

a slightly worse answer delivered in 2 seconds is better than a perfect answer that times out

. Most users would rather get a Haiku-quality response immediately than wait 45 seconds for an Opus response that might never arrive. The fallback chain encodes this preference explicitly.

Fallback Chain

Progressive degradation: best model → fastest model → error
Opus
30s timeout
fail
Sonnet
15s timeout
fail
Haiku
5s timeout
fail
Graceful Error
Clear message, no hung request

VI. Observability: You Can’t Fix What You Can’t See

Debugging an agentic system without observability is like debugging a distributed system without logs — theoretically possible, practically impossible. Every agent call, every tool invocation, every routing decision needs to be traced.

We instrument three things: traces (the full request lifecycle from user input to final response, including every agent hop and tool call), evaluations (automated quality checks on agent outputs, comparing against known-good examples), and cost tracking (token usage per agent, per workflow, per user — because a runaway agent loop can burn through $100 in tokens before anyone notices).

The most useful metric we track is tool call success rate per agent. When this drops below 95% for any agent, it’s usually a sign that the system prompt has drifted or the underlying tool has changed its behavior. This single metric has caught more issues than all our other monitoring combined.

The One Metric That Matters

Track tool call success rate per agent. When an agent’s success rate drops below 95%, something has changed — the system prompt, the tool’s behavior, or the nature of incoming requests.

This single metric catches prompt drift, tool API changes, and routing errors before they become user-facing incidents.

VII. The Architecture That Survived Production

After a year of iteration, here’s the architecture we’ve converged on. It’s not elegant. It’s not novel. It works.

Layer 1: Router. Hybrid deterministic + LLM classifier. Receives user input, decides which agent handles it. Stateless. Fast. If uncertain, asks for clarification.

Layer 2: Specialized agents. Each agent has a focused system prompt (under 2,000 tokens), 3-5 MCP tools, and a specific output format. Agents don’t talk to each other directly — they return structured results to the orchestrator.

Layer 3: Orchestrator. Maintains workflow state, manages context windows, handles multi-step tasks by chaining agent calls. Implements retry logic and fallback chains. This is the only stateful component.

Layer 4: MCP servers. Typed tool interfaces for all external systems. Each server is independently deployable and testable. Adding a new tool means deploying a new MCP server — no changes to any agent.

VIII. What I’d Do Differently

If I were starting over, three things would change:

Start with evaluations, not features. We built agents first and evaluations second. This meant we couldn’t measure whether changes improved quality until weeks after deploying them. Build your eval suite before your first agent.

Version system prompts like code. System prompts are code. They should be in version control, reviewed in PRs, tested against eval suites, and deployed with rollback capability. We learned this after a “minor wording change” to a system prompt caused a 30% drop in one agent’s accuracy.

Budget tokens like you budget compute. Every agent call has a cost. Without explicit budgets, workflows silently become expensive. Set per-workflow token budgets, alert when they’re exceeded, and treat token cost as a first-class metric alongside latency and accuracy.

IX. The Bottom Line

AI agent orchestration at scale is fundamentally an infrastructure problem, not an AI problem. The LLMs are powerful enough. The tools exist. What most teams lack is the discipline to treat agentic systems with the same rigor they’d apply to any distributed system: clear interfaces, failure handling, observability, and incremental deployment.

The teams that succeed aren’t the ones with the most sophisticated prompts. They’re the ones that build the most boring, reliable infrastructure around the model — and then iterate relentlessly on what the model does within those guardrails.

Frequently Asked Questions

Why do monolithic AI agents fail at scale?

Monolithic agents with large system prompts (4,000+ tokens) and 20+ tools lose coherence — tool call accuracy drops to ~85%, errors cascade across the shared context window, and debugging becomes impossible. Decomposing into specialized agents with 3-5 tools each, coordinated by a lightweight router, restores reliability.

What is the difference between a single LLM call and an agentic system?

A single LLM call is stateless: prompt in, completion out. An agentic system maintains state across multiple calls, uses tools, makes decisions about what to do next, and can recover from failures. The orchestration layer is what turns isolated LLM calls into a coherent workflow.

Why do agents need MCP servers instead of custom tool integrations?

MCP (Model Context Protocol) servers provide a standardized, typed interface for agents to interact with external systems. Without MCP, each tool integration requires custom adapter code — manageable for five tools, a maintenance nightmare for fifty. With MCP, adding a new tool means deploying a new server; every agent in the system can immediately use it.

How do you handle LLM failures in production agent systems?

Production agent systems use fallback chains: if the primary model fails or returns low-confidence output, the system automatically retries with a different model or a simplified prompt. Combined with circuit breakers and timeout policies, this achieves 99.9%+ effective uptime even when individual LLM providers have outages.

What is the biggest mistake teams make with AI agent architectures?

The biggest mistake is building a monolithic 'god agent' that tries to handle everything. This fails because LLMs lose coherence with long contexts, errors cascade unpredictably, and the system becomes impossible to debug. The solution is decomposition: specialized agents with clear responsibilities, coordinated by a lightweight router.

Laurent Goudet

CTO at Freelancer.com

AI agents, networking, and infrastructure at scale

Other deep-dives

© 2026 Laurent Goudet · Bordeaux, France · lepro.dev

vd9714f4