Nvidia Opens the Agent Stack

Nvidia's new Nemotron 3 Ultra is not subtle about its intended job. The company describes it as an open 550-billion-parameter mixture-of-experts model for long-running agents: systems that plan, call tools, write code, research, and keep working past the tidy two-minute demo. The useful number is not only 550 billion. It is 55 billion active parameters, the slice doing work for each token, because Nvidia's pitch is about capability that can still be served without turning every agent run into a boutique compute event.

The release landed on June 4 in an Nvidia Developer post by Chris Alexiuk and Chintan Patel, with the technical report and model card filling in the harder details. Nemotron 3 Ultra uses a hybrid Mamba-Transformer design, LatentMoE, multi-token prediction, and NVFP4 quantization. MarkTechPost's summary adds that the model was pre-trained on 20 trillion tokens and extended to a one-million-token context window. The technical report gives the fuller title: "Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning." A title like that doesn't leave much room for mystery.

I don't read this as Nvidia trying to become another chatbot company. That would be too small, and probably the wrong business. The interesting move is that Nvidia is making the model legible to the machinery around it: NIM packaging, Hugging Face weights, build.nvidia.com, OpenRouter, Perplexity, Anaconda, and the usual cloud-adjacent routes by which a research artefact becomes a thing people can actually run. This is the company that already owns too much of the physical substrate of AI deciding that the software layer above the chips should also speak its language.

The open part matters because it changes where trust is supposed to sit. Nvidia is not asking developers to admire a leaderboard number from outside the glass. It is offering weights, data, recipes, quantized checkpoints, and deployment routes, then daring the rest of the stack to form around them. The Hugging Face card lists the OpenMDW 1.1 license, a June 4 release date, the 550B total and 55B active parameter counts, and the same one-million-token context length. That is a very specific kind of openness: not a hobbyist toy, not a sealed API, but an industrial object with enough handles for other companies to build around.

There is a familiar Nvidia pattern here. In the AlexNet story, two GTX 580 cards helped make a neural-network result suddenly impossible to ignore. The hardware did not invent deep learning, but it changed what could be repeated quickly enough to matter. Nemotron 3 Ultra sits much further up the stack, yet the logic rhymes. Nvidia keeps turning constraints into platforms: memory limits, throughput, quantization, inference cost, now the long, expensive drift of agent work.

The risk is that "open" becomes another way of tightening the ecosystem. If the model, serving layer, quantization format, optimization story, and preferred deployment surface all line up around Nvidia, developers gain access while the centre of gravity still moves toward the same vendor. That is not hypocrisy. It is strategy. A closed API can own the user relationship. An open model can own the default architecture, especially when the default architecture has to run somewhere expensive.

Nvidia says the supporting video describes up to five times faster inference and up to 30 percent lower cost. Those numbers need the usual caution, because vendor launch claims are not neutral field reports. However, they point at the real argument. Long-running agents are not only a model-quality problem. They are a patience problem, a scheduling problem, a bill problem, a question of whether anyone wants to wait while a machine thinks, searches, checks, fails, and tries again. If Nvidia can make that loop cheaper and more deployable, the model itself is only part of the release.

This also explains why the China-chip story keeps haunting Nvidia's AI news. I wrote recently about Beijing banning Nvidia's compromised 5090D V2, a reminder that hardware access is political before it is technical. Nemotron 3 Ultra is the other side of the same company: not merely selling scarce accelerators, but shaping the work those accelerators are meant to do. The agent future, if it arrives in the form Nvidia wants, won't just need chips. It will need a stack, a license, a recipe, and somewhere to run.

Sources:

NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents — NVIDIA Developer
Nemotron 3 Ultra Technical Report — NVIDIA Research
NVIDIA Nemotron 3 Ultra 550B A55B NVFP4 — Hugging Face
NVIDIA AI Releases Nemotron 3 Ultra — MarkTechPost
Introducing NVIDIA Nemotron 3 Ultra — NVIDIA Developer

Plutonic Rainbows

Nvidia Opens the Agent Stack

Related Entries