The Coordination Tax Nobody Budgeted For

The math is simple and merciless. If each agent in a pipeline performs at 95% accuracy — generous, honestly — a ten-step chain succeeds about 60% of the time. That's the textbook version. The production version is worse, because errors don't merely accumulate. They cascade destructively. An agent that misinterprets its task at step three feeds corrupted context to step four, which amplifies the distortion at step five, and by step eight you're debugging entropy.

This is the coordination tax. Everyone building multi-agent systems pays it, and almost nobody budgets for it.

Anthropic published a detailed account of building their own multi-agent research system earlier this year, and the failure modes they describe are instructive. Their agents spawned fifty subagents for simple queries. They scoured the web endlessly for nonexistent sources. They duplicated each other's work. Token consumption hit 15x normal chat interactions — not because the system was doing fifteen times more useful work, but because coordination overhead grows faster than capability.

The numbers from production deployments are stark. Between 41% and 87% of multi-agent LLM systems fail in production, depending on whose survey you trust. Nearly 80% of those failures stem from specification ambiguity and coordination breakdowns. Not infrastructure. Not hallucination. Organizational problems — who does what, who talks to whom, who has authority to override.

I've written before about the orchestra-without-a-conductor problem in enterprise multi-agent deployments. The metaphor holds up uncomfortably well. Individual agents can be brilliant. The ensemble performance still falls apart without clear governance.

The framework landscape reflects this confusion. LangGraph offers deterministic graph execution but can't prevent runaway loops — one engineer burned four dollars in API costs on eleven revision cycles before adding a manual counter. CrewAI hits a ceiling the moment you need anything beyond straightforward sequential handoffs. AutoGen's auto speaker selection makes arbitrary decisions about which agent acts next, sometimes skipping critical steps entirely. These aren't bugs. They're design consequences of a problem space nobody has cleanly solved.

A growing body of research suggests the answer might be simpler than we want it to be. As frontier models improve at long-context reasoning and tool use, the gap between single-agent and multi-agent performance narrows. One study found that a hybrid approach — try single-agent first, escalate to multi-agent only when needed — improved accuracy while cutting costs by up to 20%. Deloitte's 2026 survey is blunter: over 40% of agentic AI projects could be cancelled by 2027.

I keep returning to a line from Anthropic's engineering post: "agent-tool interfaces are as critical as human-computer interfaces." The hard part of multi-agent orchestration isn't writing agents. It's writing the contracts between them — the handoff protocols, the error boundaries, the authority hierarchies. Subagent architecture is contract law, not software engineering.

The compounding error math doesn't care how clever your agents are individually. It only cares how well they coordinate. And coordination, it turns out, is exactly the thing large language models are worst at faking.

Sources:

How We Built Our Multi-Agent Research System — Anthropic Engineering
Why Your Multi-Agent System Is Failing: Escaping the 17x Error Trap — Towards Data Science
Why Multi-Agent LLM Systems Fail — Augment Code
AutoGen vs LangGraph vs CrewAI — DEV Community
Unlocking Exponential Value with AI Agent Orchestration — Deloitte TMT Predictions

Plutonic Rainbows

The Coordination Tax Nobody Budgeted For

Related Entries