Moonshot AI released Kimi K2.6 yesterday, and the numbers on the release page read less like a model card and more like a shift log. Twelve hours optimising inference in Zig, four thousand tool calls, throughput going from roughly fifteen to roughly 193 tokens a second. Thirteen hours rewriting a financial matching engine. One K2.6-backed agent, according to the company, ran autonomously for five days managing infrastructure monitoring and incident response.

Five days.

That's the headline the benchmarks don't quite capture. The benchmarks are there, of course — 58.6% on SWE-Bench Pro, up from K2.5's 50.7%; 66.7% on Terminal-Bench 2.0, up from 50.8% — and they're competitive with anything open-weight you can download today. But the duration numbers are a different claim entirely. They're about how long the thing can stay coherent before it loses the plot.

A one-trillion-parameter Mixture-of-Experts with 32 billion activated. 256K context. Modified MIT license. Up to 300 sub-agents coordinating across 4,000 simultaneous steps — triple the agent count and nearly three times the step budget of the K2.5 release. The shape of the thing is increasingly that of a small fleet rather than a model, and you can pull the weights down today.

I keep returning to the Zig story. Twelve hours, four thousand tool calls, one quietly improving benchmark. The reason it's interesting isn't that a human couldn't have done it — a good systems engineer could have, given the time. It's that the failure mode people assumed would kick in somewhere around the second or third hour apparently didn't. The agent didn't forget what it was doing. It didn't spiral into a cul-de-sac of retries. It kept working.

Whether Moonshot's numbers hold up in independent hands is the usual question. Lab-reported benchmarks have a habit of softening in the wild, and "autonomous for five days" is the kind of claim that wants scrutiny. But the weights are public, so the scrutiny can actually happen. That's the part that matters more than any single benchmark. Last year the open-weight tier was catching up on static tasks. This year it's catching up on the dimension that used to separate toys from tools: can the thing run long enough to be useful.

We've been tracking this trajectory for a while. Every month the gap narrows in some new direction — cost, then licensing, now wall-clock endurance. Frontier labs still have a lead on raw intelligence per token, but the measurement most people actually care about is closer to "how far can I leave it running before I need to come back." On that axis something genuinely new is showing up in the open.

Sources: