Skip to content

Plutonic Rainbows

The $1-an-Hour Frontier Model

MiniMax released M2.5 this week, and the number I keep returning to isn't the benchmark score — it's the price. One dollar per hour of continuous generation at 100 tokens per second. That's the Lightning variant. The standard version halves even that.

The benchmarks are strong enough to make the pricing genuinely strange. SWE-Bench Verified: 80.2%, which puts it within 0.6 points of Claude Opus 4.6 at 80.8%. On Multi-SWE-Bench — the multilingual coding benchmark — M2.5 actually leads at 51.3% versus Opus's 50.3%. Tool calling scores 76.8%, which is 13 points clear of Opus. These aren't cherry-picked metrics from a press release. OpenHands ran independent evaluations and confirmed the numbers hold up.

The architecture is Mixture of Experts — 230 billion parameters total, 10 billion active per token. That's how you get frontier performance at commodity pricing. MiniMax trained it using what they call Forge, a reinforcement learning framework running across 200,000 simulated real-world environments. Their custom RL algorithm — CISPO — claims a 40x speedup over standard approaches. Whether that number survives independent scrutiny, I don't know. However, the outputs speak for themselves.

The weights are fully open on HuggingFace. You can download M2.5 right now and run it locally. This is the part that matters more than any single benchmark. When DeepSeek dropped R1 as open source thirteen months ago, it triggered genuine panic in Silicon Valley. MiniMax is doing the same thing but with a model that competes at the very top of the leaderboard, not just near it.

MiniMax itself is an interesting company to watch. They IPO'd in Hong Kong in January, raising HK$4.8 billion. The stock has more than tripled since. Over 70% of their revenue comes from overseas markets — primarily through Talkie, their companion app, and Hailuo, their video generation tool. CEO Yan Junjie recently met with Premier Li Qiang. This isn't a scrappy lab operating out of a garage. It's a well-funded operation with state-level attention.

What I find myself thinking about: the cost differential is now 10-20x between M2.5 and the American frontier models. Opus 4.6 charges $5/$25 per million tokens. M2.5 charges $0.15/$1.20. At those margins, the question isn't whether open-weight models are good enough. It's whether the closed models can justify the premium when the gap is this thin.

Sources:

Deep Think Crosses the Human Line

Google upgraded Gemini 3 Deep Think yesterday, and the number that matters is 84.6%. That's its score on ARC-AGI-2, the abstract reasoning benchmark designed to resist brute-force pattern matching. Humans average around 60%. Claude Opus 4.6 — which landed last week to genuine excitement — scores 68.8%. GPT-5.2 manages 52.9%. Deep Think clears the human baseline by nearly 25 points and leads the next-best model by almost 16.

I'm trying to figure out what to do with that.

The Codeforces result is harder to dismiss as benchmark theatre. Deep Think hit 3,455 Elo — Legendary Grandmaster territory, better than all but seven active human programmers on the platform. No external tools. No retrieval. Just inference-time compute and whatever Google means by "parallel hypothesis exploration." The top human competitor, Benq, sits at 3,792. That gap is closing fast enough to make competitive programming feel like it has an expiration date.

What changed from the previous version: scope. Earlier iterations of Deep Think were narrowly focused on mathematics. This upgrade pushes into chemistry, physics, and engineering. Gold medals on the written portions of the International Math, Physics, and Chemistry Olympiads. A mathematician at Rutgers used it to peer-review a paper on high-energy physics structures bridging gravity and quantum mechanics. It caught a subtle logical flaw that human reviewers had missed. That's not a benchmark. That's a real research contribution, however narrow.

The architecture Google describes — they call it "Aletheia" — uses a generator, a natural language verifier, and a reviser working in concert. Parallel hypothesis exploration rather than a single reasoning chain. The interesting detail is that the system can acknowledge failure and stop rather than burning compute on dead-end paths. Most reasoning models I've used have no concept of giving up gracefully. They hallucinate forward until they hit a token limit. If Aletheia genuinely knows when it's stuck, that's a meaningful advance in how these systems manage uncertainty.

Google's approach here is fundamentally different from what Anthropic and OpenAI are doing. They're scaling inference-time compute — giving the model more time to think rather than making a bigger model. The base is still Gemini 3 Pro, not some trillion-parameter behemoth. Deep Think is a reasoning mode, not a separate model. The distinction matters because it suggests the ceiling on what you can extract from existing architectures is higher than most people assumed. You don't need a fundamentally new model. You need to let the current one actually think.

That feels right to me, intuitively. When I use extended thinking in Claude, the quality jump over instant responses is enormous — not because the model suddenly knows more, but because it has room to work through contradictions and dead ends before committing to an answer. Google is doing the same thing with significantly more compute thrown at the problem. Anthropic shows you the reasoning. Google hides it. Both approaches produce results that make the non-thinking versions look careless by comparison.

The pricing is interesting. Deep Think is included in the Google AI Ultra subscription at $249.99 per month. API access requires applying for an early programme. I keep thinking about how o3 was positioned as the reasoning breakthrough that would change everything, and then Deep Think shows up a year later scoring nearly 30 times higher on the same class of benchmark. The pace of obsolescence in this space is genuinely disorienting.

Demis Hassabis called it "new records on the most rigorous benchmarks in maths, science & reasoning." MarkTechPost ran with "Is This AGI?" — which, no. But I understand the impulse. A system that reasons better than the average human on abstract pattern recognition, codes better than 99.99% of programmers, and catches errors in peer-reviewed physics papers occupies territory that didn't exist twelve months ago.

Google DeepMind published a research impact taxonomy alongside the release, rating contributions from Level 0 to Level 4. They classify Deep Think's current output at Levels 0-2 — autonomous solutions and publishable collaborations, not landmark breakthroughs. The fact that they felt the need to temper expectations tells you something about the temperature of the conversation. When the company releasing the model is the one saying "calm down," the benchmarks have moved past what anyone's frameworks were built to accommodate.

Sources:

Three Folders and a Paywall

I've been using Dub.co for short link management on this blog — it handles the uneasy.in links that appear on every post. It's a well-made product. The dashboard is clean, the API is solid, and the analytics are genuinely useful. So when they announced folders last year as a way to organise links, I thought: great, I'll set up a proper structure. Blog links in one folder, project links in another, maybe a third for experiments.

Then I hit the limit. Three folders. That's it on the Pro plan at $25 a month.

Three folders is not an organisational system — it's a tease. It's the SaaS equivalent of giving someone a filing cabinet with three drawers welded shut and a price tag dangling from the fourth. If you want twenty folders, you'll need the Business plan at $75 a month. Fifty folders? That's $250 a month on Advanced. The jump from three to twenty costs you an extra $600 a year, and the primary thing you're paying for is the right to put links into named groups.

What makes this especially irritating is how clearly it's designed as a conversion lever rather than a technical constraint. Folders are metadata. They're a label on a row in a database. There's no compute cost, no storage overhead, no bandwidth implication. The limit exists purely to create friction — to let you taste the feature just enough that you'll upgrade when you inevitably need a fourth folder. It's the same pattern I've seen with other services: let you in at a reasonable price, then gate the basics behind a tier that costs three times as much.

The broader context makes it worse. Pro also caps you at 25 tags and 1,000 new links per month. Business unlocks unlimited tags — which tells you exactly how much those tag and folder limits cost Dub to enforce: nothing. They're artificial scarcity on a digital product, and the pricing tiers are structured so that basic organisational hygiene requires a plan designed for teams of ten.

I ended up cancelling the whole folder exercise. Not because I couldn't afford the upgrade, but because I don't want to reward a pricing model that treats elementary features as premium upsells. I'll keep using Dub for what it does well — the API, the custom domain, the click tracking — but the folders will stay empty. Three is too few to be useful and too many to ignore.

This is the quiet frustration of modern SaaS. The product is good. The engineering is thoughtful. And then someone in product or finance decides that the difference between $25 and $75 should be the ability to put things in named groups. It's a small thing, but small things accumulate. Every service you use has its own version of this — the feature that's obviously trivial to provide but sits just above your current tier, daring you to pay more for what should have been included from the start.

I remain a paying customer. But I'm filing this complaint in one of my three allotted folders.

The Thousand-Token Gambit

OpenAI shipped Codex-Spark yesterday — a smaller GPT-5.3-Codex distilled for raw speed, running on Cerebras Wafer Scale Engine 3 hardware at over a thousand tokens per second. Four weeks from a $10 billion partnership announcement to a shipping product. 128k context, text-only, ChatGPT Pro research preview.

The pitch is flow state — edits so fast the latency disappears and you stay in the loop instead of watching a spinner. Anthropic is chasing the same thing with Opus fast mode. Everybody is.

I wrote about speed becoming the only moat last month. Codex-Spark is that thesis made silicon.

Sources:

Two Billion in Efficiency Savings and What Gets Lost

Barclays posted £9.1 billion in pre-tax profit for 2025 — up 13% — and CEO C.S. Venkatakrishnan used the results announcement to outline an AI-driven efficiency programme targeting £2 billion in gross savings by 2028. Fraud detection, client analytics, internal process automation. Fifty thousand staff getting Microsoft Copilot, doubling in early 2026. Dozens of London roles relocated to India. A £1 billion share buyback and £800 million dividend to round things off. The shareholders are happy. The spreadsheet is immaculate.

I don't doubt the savings are real. Every bank running these numbers is finding the same thing — operations roles that involve documents, repeatable steps, and defined rules are precisely where large language models excel. Wells Fargo is already budgeting for higher severance in anticipation of a smaller 2026 workforce. JPMorgan reports 6% productivity gains in AI-adopting divisions, with operations roles projected to hit 40-50%. Goldman Sachs has folded AI workflow redesign directly into its headcount planning. This isn't speculative anymore. The back offices are getting thinner.

What bothers me is the framing. "Efficiency" is doing a lot of heavy lifting in these announcements. When Barclays says it will "harness new technology to improve efficiency and build segment-leading businesses," what that means in practice is fewer people answering phones, fewer people reviewing transactions, fewer people in the building. The GenAI colleague assistant that "instantly provides colleagues with the information needed to support customers" is, by design, an argument for needing fewer colleagues. The call handling times go down. Then the headcount follows.

The banking industry's own estimates are stark. Citigroup found that 54% of financial jobs have high automation potential — more than any other sector. McKinsey projects up to 20% net cost reductions across the industry. Yet 76% of banks say they'll increase tech headcount because of agentic AI. The jobs don't disappear. They migrate — from the person who knew the process to the person who maintains the model that replaced the process. Whether that's a net positive depends entirely on which side of the migration you're standing on.

Barclays will likely hit its targets. The efficiency savings will materialise. The return on tangible equity will climb toward that 14% goal. The question nobody at the investor presentation is asking — because it isn't their question to ask — is what a bank actually is when you've automated the parts where humans used to make judgement calls about other humans. A fraud model is faster than a fraud analyst. It's also completely indifferent to context, to the phone call where someone explains they've just been scammed and needs a person, not a pipeline, to understand what happened.

Two billion pounds is a lot of understanding to optimise away.

Sources:

The Album That Won't Stay Mastered

Discogs lists hundreds of distinct pressings of The Dark Side of the Moon. Not hundreds of copies — hundreds of versions. Different countries, different plants, different lacquers, different decades. The album has been remastered, remixed, repackaged, and re-released so many times that cataloguing it has become its own cottage industry. And the question nobody seems willing to answer plainly is: why?

The charitable explanation is format migration. Every time the music industry invents a new way to sell you sound, Dark Side gets dragged through the machine again. The original 1973 vinyl. The first CD pressing in 1984, harsh and tinny on those early CBS/Sony discs with their pre-emphasis problems. The 1992 remaster for the Shine On box set. The 2003 SACD and 180-gram vinyl cut by Doug Sax and Kevin Gray at AcousTech — a thirteen-hour session from the original master tape that many collectors still consider the definitive pressing. The 2011 Immersion Edition. The 2023 50th Anniversary remaster by James Guthrie, now with Dolby Atmos because apparently forty-three minutes of music required spatial audio to finally be complete.

That's seven major versions across fifty-two years, and I've skipped the Mobile Fidelity half-speed mastering from 1979, the UHQR limited edition from 1981 (5,000 numbered copies on 200-gram JVC Super Vinyl, housed in a foam-padded box like a medical instrument), and whatever picture disc or coloured vinyl variant was being sold in airport gift shops during any given decade. Each one presented as essential. Each one implying — sometimes quietly, sometimes on the sticker — that the previous version hadn't quite got it right.

The format argument holds up to a point. Moving from vinyl to CD requires a new master. Moving from stereo to 5.1 surround requires a remix. Moving from 5.1 to Atmos requires another remix. These are genuinely different processes with different sonic results, and James Guthrie — who engineered The Wall and has overseen Pink Floyd's catalogue since the early 1980s — isn't a hack. The 2003 SACD's 5.1 mix is a legitimate reinterpretation. The Atmos version reportedly uses the original multitracks and places instruments in three-dimensional space in ways that serve the music rather than showing off the format. I haven't heard it, so I'll stop there.

But format migration doesn't explain the sheer volume. It doesn't explain why the 2023 box set contains substantially less material than the 2011 Immersion box while costing nearly three times as much. It doesn't explain the 2024 clear vinyl reissue — two LPs with only one playable side each, so the UV artwork can be printed on the blank side. That's not a format improvement. That's a shelf ornament.

The real answer is simpler and more uncomfortable: Dark Side of the Moon is the safest bet in recorded music. It has sold somewhere north of forty-five million copies. It spent 937 weeks on the Billboard 200. Every reissue is guaranteed to sell because the album occupies a category beyond mere popularity — it's become a cultural default, the record people buy when they buy a turntable, the disc they reach for when demonstrating a new pair of speakers. The music is secondary to the ritual. Not because it isn't good — it's extraordinary — but because the purchasing decision has decoupled from the listening experience. People buy Dark Side the way they buy a bottle of wine for a dinner party. It's a known quantity. It cannot embarrass you.

This makes it uniquely exploitable. A record label can repackage it every five years with minor sonic tweaks and a new essay in the liner notes, and the installed base of buyers will absorb the inventory. The audiophile press will review it. Forums will debate whether the new pressing sounds warmer or brighter or more "analogue" than the last one. And none of this requires the band's active participation, which is convenient given that Roger Waters hasn't been in the same room as David Gilmour voluntarily since approximately 1985.

I own three versions. The 2003 vinyl, which sounds superb. A CD rip from the 2011 remaster, which sounds almost identical. And a 192kHz/24-bit Blu-ray extraction that I recently compared sample-by-sample against a supposedly different edition and found to be bit-for-bit the same audio with different folder names. That last discovery crystallised something for me — how much of the remaster economy runs on labelling rather than substance. A new sticker, a new anniversary number, occasionally a new mastering engineer. Sometimes genuinely different audio. Sometimes not.

The 2003 AcousTech pressing remains the one I'd recommend to anyone who asks, though the asking itself has become part of the problem. "Which version of Dark Side should I buy?" is a question that sustains an entire ecosystem of forum threads, YouTube comparisons, and Discogs archaeology. The answer should be: whichever one you already own probably sounds fine. But that answer doesn't sell records.

The album itself — the actual forty-three minutes of music that Alan Parsons engineered at Abbey Road in 1973 — hasn't changed. The cash registers have just been cleared for another run, the heartbeat fading out on "Eclipse" before looping back to the beginning. Which, if you think about it, is rather on the nose.

Sources:

175,000 Open Doors

SentinelOne and Censys mapped 175,000 exposed AI hosts across 130 countries. Alibaba's Qwen2 sits on 52% of multi-model systems, paired with Meta's Llama on over 40,000 of them. Nearly half advertise tool-calling — meaning they can execute code, not just generate it. No authentication required.

While Western labs retreat behind API gates and safety reviews, Chinese open-weight models fill the vacuum on commodity hardware everywhere. The guardrails debate assumed someone controlled the deployment surface. Nobody does.

Sources:

The Edit You Never Made

Elizabeth Loftus showed participants footage of a car accident in 1974, then asked how fast the vehicles were going when they "smashed" into each other. A separate group got the word "hit" instead. The smashed group estimated higher speeds. A week later, they were also more likely to remember broken glass at the scene. There was no broken glass.

That experiment is nearly fifty years old now, and nothing about its conclusion has softened. Memory does not record. It reconstructs — and it reconstructs according to whatever pressures happen to be present at the moment of recall. A leading question. An emotional state. A conversation with someone who remembers the same event differently. Each retrieval is an act of editing. Steve Ramirez at Boston University describes it plainly: every time you recall something, you are hitting "save as" on a file and updating it with new information. The version you remember today is not the version you remembered last year.

What unsettles me is not that memory is inaccurate. I can accept inaccuracy. What unsettles me is that memory feels accurate — feels like retrieval rather than reconstruction. The confidence is the problem. I have memories I would defend in court, memories that feel as solid as furniture, and I know from Loftus's work that solidity means almost nothing. The vividness of a memory has no reliable relationship to its truth.

Negative experiences make this worse. Research into emotional valence and false recall shows that negatively charged events produce higher rates of false memory than neutral ones. The things that hurt you are the things most likely to be rewritten. Not erased — rewritten. The pain stays. The facts shift underneath it.

I keep returning to Ramirez's framing because it is the only one that does not pretend this is a flaw. Memory updates because a mind locked permanently in the past would be useless. The editing is the point. It just happens to make accuracy a casualty.

Sources:

The Comparative Baseline Nobody Saved

There's a particular kind of knowledge that disappears not because it's wrong but because the conditions that produced it no longer exist. The pre-internet world — and I mean the actual texture of daily life before always-on connectivity — is becoming that kind of knowledge. Not because the people who lived through it are dying off, though they are. Because the experiences themselves are structurally incompatible with how life works now. You can't stream what it felt like to not know something and have no immediate way to find out.

Friction was the defining feature, though nobody called it that at the time. Information took effort. You went to a library, or you asked someone who might know, or you simply didn't find out. Communication had latency built in — you wrote a letter, waited days, received a reply that might or might not address what you'd asked. Choices were constrained by geography, opening hours, physical stock. None of this felt romantic while it was happening. It felt normal. The constraints were invisible in the way that water is invisible to fish.

What those constraints trained was a particular kind of cognitive stamina. When finding information costs time and effort, you develop a relationship with the information you do find. You hold it longer because acquiring it required investment. You synthesise rather than accumulate, because accumulation is expensive. You commit to decisions because reversing them means repeating the friction. Research published in Frontiers in Public Health found that intensive internet search behaviour measurably affects how we encode and retain information — when we know Google will remember for us, we let it, and something in the cognitive chain softens.

I'm not sure "softens" is the right word. It's more like a muscle that adapts to a lighter load. It doesn't atrophy overnight. It just gradually stops being asked to do what it once did.

The spatial memory research makes this uncomfortably concrete. A study in Scientific Reports found that people with greater lifetime GPS use showed worse spatial memory during self-guided navigation, and that the decline was progressive. The more you outsource, the less you retain. This isn't generalised anxiety about screens. It's measured cognitive change in a specific domain, and it maps onto the broader pattern: navigation, arithmetic, phone numbers, directions to a friend's house. Each delegation is individually trivial. Collectively, they represent a wholesale transfer of embodied competence to external systems.

The social texture was different too, in ways that are harder to quantify. Relationships were local, bounded, and — this is the uncomfortable part — more durable partly because leaving was harder. You couldn't algorithmically replace your social circle. You couldn't find a new community by searching for one. You were stuck with the people near you, which meant you developed tolerance, negotiation, and the particular skill of maintaining connections with people you didn't entirely like. I'm not romanticising this. Some of those constraints were suffocating. But they produced a form of social resilience that frictionless connection and equally frictionless disconnection don't replicate. The exit costs created a kind of civic muscle memory. When you couldn't easily leave, you learned to stay — imperfectly, sometimes resentfully, but with a persistence that builds something. What it builds is hard to name. Continuity, maybe. The knowledge that not every disagreement requires a door.

There's something related happening with culture, and it troubles me more than the cognitive stuff. Before global networks, culture varied sharply by place. Music scenes were geographically specific. Fashion moved through cities at different speeds. Slang was regional. You could travel two hundred miles and encounter genuinely different aesthetic assumptions. The internet collapsed that distance, and what rushed in to fill the gap was algorithmic homogenisation — platforms optimising for universal palatability, training data drawn overwhelmingly from dominant cultural archives, trend cycles that now complete in weeks rather than years. The result isn't the death of diversity exactly. It's the flattening of it. Regional difference still exists, but it's increasingly performed rather than lived.

I've written before about objects that outlive their context — the particular unease of encountering something from the pre-internet era that still functions perfectly but belongs nowhere. A compact disc, a theatre programme, a paper map. These objects assumed finitude. They were made for a world where things ended, where events didn't persist in feeds, where places could close and stay closed. When I handle them now, the dissonance isn't aesthetic. It's temporal. They're evidence of a different relationship with time itself.

That temporal difference matters more than people tend to acknowledge. Digital platforms compress time into an endless present. Feeds refresh. History scrolls away. Nothing quite arrives and nothing quite leaves. The pre-internet experience of time was linear in a way that sounds banal until you try to describe what replaced it. Waiting was an experience, not a failure of the system. Seasons structured cultural consumption because distribution channels were physical. Anticipation — real anticipation, the kind that builds over weeks — required scarcity. When everything is available immediately, anticipation doesn't intensify. It evaporates. There was a rhythm to cultural life that physical distribution imposed whether you wanted it or not. Albums had release dates that meant something because you couldn't hear them early. Television programmes aired once, and if you missed them, you missed them. Books arrived in bookshops and either found you or didn't. This sounds like deprivation described from the outside, but from the inside it felt like structure. Things had their time, and that time was finite.

Nicholas Carr has been making versions of this argument since The Shallows in 2010, and his more recent Superbloom extends it into the social fabric. The concern isn't that the internet is making us stupider in some crude measurable way. The concern is that it's restructuring cognition and social behaviour so thoroughly that we're losing the capacity to notice what's changed. When the baseline disappears, critique becomes harder. You need a reference point to identify a shift, and if nobody remembers the reference point, the shift becomes invisible.

This is the part that feels urgent to me. Not the nostalgia — nostalgia is cheap and usually dishonest. The urgent part is the comparative function. Remembering what pre-internet life actually felt like — not a golden-age fantasy of it but the real experience, including the boredom and the limitations — provides a structural check on the present. It lets you distinguish between convenience and cognitive cost. It lets you ask whether frictionless access to everything has made us richer or just busier. It lets you notice that memory itself has changed shape, not through damage but through delegation.

Without that baseline, you get what I'd call passive inevitability. The assumption that the present is the only way things could possibly work. That algorithmic curation is simply how culture operates now. That constant connectivity is a law of physics rather than a design choice made by specific companies for specific commercial reasons. Every system benefits from the erasure of its alternatives, and the pre-internet world is the most comprehensive alternative to the digital present that most living people can personally recall.

None of this is about wanting the old world back. Plenty of it was terrible. Information gatekeeping was often unjust. Communication barriers isolated people who needed connection. Cultural insularity bred ignorance as often as it bred character. The point isn't that friction was good. The point is that friction was informative. It taught things — patience, commitment, tolerance of uncertainty, the ability to sit with not-knowing — that the frictionless environment doesn't teach and may be actively unlearning.

I keep coming back to something that probably doesn't belong in this argument. When I was young, you could be unreachable. Not dramatically, not by fleeing to a cabin — just by leaving the house. No one could contact you. No one expected to. The hours between leaving and returning were genuinely yours in a way that requires explanation now, which is itself the point. That availability wasn't a moral obligation. That silence between people wasn't a crisis. That the default state of a human being was not "online."

Sources:

Lighter, Faster, Meaner

jQuery was 87KB. Lightbox2 was another 10KB, plus a CSS file and four UI images the library needed for its close buttons and navigation arrows. All of that is gone now — replaced by a single vanilla JS file under 7KB that does everything the old stack did. Gallery navigation with wrap-around, keyboard support, scroll locking, caption display, adjacent image preloading. Same IIFE pattern as the video lightbox I built last week.

The video player got its own round of trimming. I'd been eagerly loading hls.js and the video lightbox script on every page that contained video links — 149KB of transfer whether anyone clicked play or not. Now both scripts lazy-load on the first actual click. The hls.js library itself moved from a CDN with a seven-day cache to self-hosted with a one-year immutable header. And the encryption got a quiet upgrade: random IVs instead of deterministic MD5-based ones.

None of these changes alter how anything looks. Every byte saved is invisible.