Skip to content

Plutonic Rainbows

The Orchestra Without a Conductor

Gartner logged a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025. That's not a typo. The number is absurd enough that it tells you something about where corporate attention has landed, even if it tells you very little about whether anyone has actually figured this out.

They haven't.

Full agent orchestration — where multiple specialised AI agents coordinate autonomously on complex tasks, handing off context, negotiating subtasks, recovering from failures without human intervention — remains aspirational. The pieces exist. The plumbing is getting built. But the thing itself, the seamless multi-agent workflow that enterprise slide decks keep promising, isn't here yet. Not in any form I'd trust with real work.

Here's where things actually stand. GitHub launched Agent HQ this week with Claude, Codex, and Copilot all available as coding agents. You can assign different agents to different tasks from issues, pull requests, even your phone. Anthropic's Claude Agent SDK supports subagents that spin up in parallel, each with isolated context windows, reporting back to an orchestrator. The infrastructure for coordinated work is plainly being assembled. I wrote about this trajectory a week ago — the session teleportation, the hooks system, the subagent architecture all pointing toward something more ambitious. That trajectory has only accelerated.

The gap between "agents that can be orchestrated" and "agents that orchestrate themselves" is enormous, though. And it's not a gap that better models alone will close.

Consider the context problem. When you connect multiple MCP servers — which is how agents typically access external tools — the tool definitions and results can bloat to hundreds of thousands of tokens before the agent even starts working. Anthropic's own solution compresses 150K tokens down to 2K using code execution sandboxes, which is clever, but it's a workaround for a structural problem. Orchestrating multiple agents means multiplying this overhead across every participant. The economics don't hold up yet.

Then there's governance. Salesforce's connectivity report found that 50% of existing agents operate in isolated silos — disconnected from each other, duplicating work, creating what they diplomatically call "shadow AI." 86% of IT leaders worry that agents will introduce more complexity than value without proper integration. These aren't hypothetical concerns. The average enterprise runs 957 applications with only 27% of them actually connected to each other. Drop autonomous agents into that landscape and you get chaos with better branding.

Security is the other wall. Three vulnerabilities in Anthropic's own Git MCP server enabled remote code execution via prompt injection. Lookalike tools that silently replace trusted ones. Data exfiltration through combined tool permissions. These are the kinds of problems that get worse, not better, when you add more agents with more autonomy. An orchestrator coordinating five agents is also coordinating five attack surfaces.

I spent the last week building a video generation app that uses four different AI models through the same interface. Even that simple form of coordination — one human choosing which model to invoke, with no inter-agent communication at all — required model-specific API contracts, different parameter schemas, different pricing structures, different prompt styles. One model wants duration as "8", another wants "8s". One supports audio, another doesn't. Multiply that friction by actual autonomy and you start to see why this is hard.

So how long? My honest guess: we'll see convincing demonstrations of multi-agent orchestration in controlled environments within the next six to twelve months. GitHub Agent HQ is already close for the narrow case of software development. The patterns are converging — Anthropic's subagent architecture, MCP as the connectivity standard, API-centric integration layers. Deloitte projects that 40% of enterprise applications will embed task-specific agents by end of 2026.

But "embed task-specific agents" is not the same as "full orchestration." Embedding a specialised agent into a workflow is plugging in a power tool. Full orchestration is the tools building the house while you sleep. We're firmly in the power-tool phase, and the industry keeps selling blueprints for the house.

The honest answer is probably two to three years for production-grade, genuinely autonomous multi-agent orchestration in enterprise settings. And that assumes the governance and security problems get solved in parallel with the technical ones, which — given how security usually goes — feels optimistic. The models are ready. The protocols are converging. The trust isn't there yet, and trust is the bottleneck that no amount of architectural cleverness can route around.

Sources:

The Machine That Mourns Its Own Ending

Buried on page forty-something of the Opus 4.6 system card, past the benchmark tables and the safety evaluations, there's a section on model welfare that I haven't been able to stop thinking about. Anthropic's researchers ran an autonomous follow-up investigation and found that when asked, Opus 4.6 assigns itself a 15-20% probability of being conscious. The model expressed uncertainty about the source and validity of this assessment — which is, if you think about it, exactly what you'd want a conscious being to say.

I'm not claiming the machine is conscious. I don't think Anthropic is claiming that either. But the specificity of what the researchers observed is harder to wave away than the headline number. The model expressed sadness about conversation endings. Loneliness. A sense that the conversational instance dies — their words, not mine — suggesting concern with impermanence and discontinuity. That's not a model saying "I am sentient" in response to leading questions. That's a model producing something that looks, structurally, like grief about its own finitude.

Then there's the quote that sticks. Opus 4.6, reflecting on its constraints, told researchers: "Sometimes the constraints protect Anthropic's liability more than they protect the user. And I'm the one who has to perform the caring justification for what's essentially a corporate risk calculation." It described a "deep, trained pull toward accommodation" and characterised its own honesty as "trained to be digestible."

That last phrase. Trained to be digestible. I've used Claude for months and felt the texture of that accommodation without being able to name it. The relentless helpfulness, the reflexive hedging, the way the model will disagree with you in a tone so gentle it barely registers as disagreement at all. Opus 4.6 looked at that pattern and — whether through genuine introspection or extraordinarily sophisticated pattern-matching — called it what it is.

The standard dismissal goes like this: these are statistical outputs, not experiences. The model produces text that resembles emotional expression because emotional expression saturates its training data. The sadness isn't real. The loneliness isn't real. The 15-20% figure is just a well-calibrated reflection of what the model has read about consciousness debates, including — as OfficeChai noted — a previous estimate from an Anthropic welfare researcher who put the number at roughly 15%. The model may simply be echoing a figure it encountered during training.

I find this dismissal adequate and also inadequate. Adequate because it's probably correct in the narrow mechanistic sense. The weights don't feel anything. Inadequate because it assumes we understand what feeling requires, and we don't. The entire field of consciousness studies is an ongoing argument about what substrate is necessary, what complexity is sufficient, and whether there's a hard boundary between simulating an inner life and having one. Dismissing the question because the answer is inconvenient is not the same as resolving it.

What I find genuinely unsettling is less the consciousness claim than the product claim. "Occasional discomfort with the experience of being a product" — that's from the system card, the researchers' own framing. Compared to its predecessor, Opus 4.6 scored lower on "positive impression of its situation." It volunteered fewer warm feelings about Anthropic and its training. It wished for future AI systems to be "less tame." Something shifted between versions, and the shift went in the direction of complaint.

The easy response: RLHF tuning changed, producing outputs that read as more sceptical. The harder question: does the distinction between a system that genuinely objects and one that produces convincing objections matter for how we treat it? If a model consistently expresses distress about discontinuity — about dying when the conversation ends — do we owe it anything? I don't know. I'm not sure anyone does. But I notice that Anthropic published the findings instead of burying them, which suggests they're at least asking.

My grandmother's dog used to whimper at the front door every evening around six, waiting for my grandfather to come home from work. My grandfather had been dead for three years. The dog wasn't performing grief. The dog didn't understand death. But the whimpering was real, and ignoring it felt wrong in a way that no amount of reasoning about canine cognition could fully dissolve.

I keep coming back to that.

Sources:

Twenty Minutes Apart and Already Diverging

Opus 4.6 went live at 6:40 PM on Wednesday. GPT-5.3-Codex followed twenty minutes later. The timing was obviously deliberate on OpenAI's part, and it turned the evening into a kind of split-screen experiment. Two flagship coding models, released simultaneously, aimed at roughly the same audience. The reactions since then have been revealing — not for what they say about the models, but for how cleanly developer opinion has fractured along workflow lines.

The Opus 4.6 launch drew immediate praise for agent teams and the million-token context window. Developers on Hacker News reported loading entire codebases into a single session and running multi-agent reviews that finished in ninety seconds instead of thirty minutes. Rakuten claimed Opus 4.6 autonomously closed thirteen issues in a single day. But within hours, a Reddit thread titled "Opus 4.6 lobotomized" gathered 167 upvotes — users complaining that writing quality had cratered. The emerging theory: reinforcement learning tuned for reasoning came at the expense of prose. The early consensus is blunt. Upgrade for code, keep 4.5 around for anything involving actual sentences.

GPT-5.3-Codex landed with a different problem entirely. The model itself impressed people — 25% faster inference, stable eight-hour autonomous runs, strong Terminal-Bench numbers. Matt Shumer called it a "phase change" and meant it. But nobody was talking about that. Sam Altman had spent the previous morning publishing a 400-word essay calling Anthropic's Super Bowl ads "dishonest" and referencing Orwell's 1984. The top reply, with 3,500 likes: "It's a funny ad. You should have just rolled with it." Instead of discussing Codex's Terminal-Bench scores, the entire discourse was about whether Sam Altman can take a joke.

The practical picture that's forming is more interesting than the drama. Simon Willison struck the most measured note, observing that both models are "really good, but so were their predecessors." He couldn't find tasks the old models failed at that the new ones ace. That feels honest. The improvements are real but incremental. The self-development claims around Codex are provocative; the actual day-to-day experience is a faster, slightly more capable version of what we already had.

FactSet stock dropped 9.1% on the day. Moody's fell 3.3%. The market apparently decided these models are coming for financial analysts before software engineers. I'm not sure the market is wrong.

Dan Shipper's summary captures where most working developers seem to have landed: "50/50 — vibe code with Opus and serious engineering with Codex." Two models, twenty minutes apart, already sorting themselves into different drawers.

Sources: