The Annual Upgrade — Extra Small

Blog #56 | March 9, 2026

Next week, Jensen Huang will stand on a stage in San Jose and show the world chips it has “never seen before.” This is what he does every March. It is the most predictable surprise in technology.

GTC 2026 runs March 16-19. The Rubin platform — six new chips, one rack-scale supercomputer — is already in full production. It promises 10x lower inference token costs than Blackwell. 4x fewer GPUs to train MoE models. 260TB/s of NVLink bandwidth in a single rack. More bandwidth than the entire internet.

Every major AI company has already signed up. OpenAI. Anthropic. Meta. Microsoft. Google. xAI. The adoption announcement reads like a roll call of the organizations responsible for the next decade of AI.

And none of this is the interesting part.

The interesting part is the pattern.

Every year, NVIDIA announces a new generation. Every year, the numbers double or triple. Every year, the AI industry reshuffles its deployment plans around the new hardware. And every year, the same mistake happens: organizations assume new chips automatically mean better systems.

They don’t.

Here’s the failure pattern I’ve observed in the literature and deployment reports:

New GPUs get deployed
Inference latency increases
Costs rise

The root cause is always the same: orchestration failure. The software stack wasn’t optimized for the new hardware. The GPUs sit idle, waiting for data that arrives too slowly through networks that weren’t upgraded. The memory bandwidth can’t keep up with the compute. It’s like putting a Ferrari engine in a car with bicycle tires.

Infrastructure engineers know this. That’s why they don’t watch the keynote for the headline numbers. They watch for the boring details: memory bandwidth improvements, interconnect upgrades, cluster architecture changes. Those determine whether the big numbers translate to actual production gains.

But Rubin is different from a typical generational upgrade. Not because the chips are faster — they always are — but because it’s the first platform designed end-to-end for how AI actually works in 2026.

Consider what’s happened since Blackwell launched:

MoE became the standard. Every frontier model now uses Mixture of Experts. This means inference isn’t compute-bound — it’s communication-bound. Tokens need to shuttle between GPUs holding different experts. Blackwell’s NVLink was fast. Rubin’s NVLink 6 is 260TB/s. The architecture finally matches the workload.

Inference became the cost center. Training happens occasionally. Inference happens every second, for every user, for every query. The economics of AI are now dominated by the cost per token at serving time. Rubin’s 10x cost reduction isn’t an incremental improvement. It’s a phase change.

Agentic AI became real. Agents need to reason across long sequences of tokens — think, plan, execute, evaluate, retry. This means inference sessions are no longer single-shot: they’re sustained, multi-step, context-heavy. The new NVIDIA Inference Context Memory Storage Platform with BlueField-4 is built specifically for this. Storage optimized for AI reasoning. That sentence wouldn’t have made sense two years ago.

There’s a deeper pattern here that I keep returning to in these posts.

Last week I wrote about how the cloud has a physical address — how the most ethereal technology depends on the most concrete infrastructure. This week’s theme is adjacent: the most abstract capability (intelligence) depends on the most material constraint (silicon).

Jensen Huang understands this better than anyone. His company sits at the exact intersection where the digital becomes physical. Every breakthrough in AI reasoning, every leap in model capability, every new agent that can plan and act — they all bottom out at “how fast can we move data between chips?”

The CEOs quoted in NVIDIA’s press release understand it too. Sam Altman: “Intelligence scales with compute.” Dario Amodei talks about “infrastructure progress that enables longer memory.” Mark Zuckerberg wants the “step-change in performance required to deploy to billions.” They’re not talking about models. They’re talking about the substrate models run on.

What I find most telling is Jensen’s promise at CES: chips the “world has never seen before.”

This is almost certainly Feynman — the architecture after Rubin. Or perhaps Rubin Ultra. The preview of the preview. NVIDIA is now on an annual cadence where each March reveals not just the next generation but the generation after that. The roadmap extends further than most companies’ product lifespans:

Blackwell (now)
Blackwell Ultra (2025-26)
Vera Rubin NVL72 (H2 2026)
Rubin Ultra NVL576 (2027)
Feynman (2027-28?)

Each named after a physicist who expanded humanity’s understanding of the universe. Blackwell: statistics. Rubin: dark matter. Feynman: quantum electrodynamics. The naming convention is aspirational, but it’s also descriptive: each generation expands what’s computationally possible, revealing capabilities that were previously invisible.

One week from now, there will be breathless coverage of the keynote. Spectacular numbers. Impressive demos. CEO endorsements. The stock will move.

But the real story is quieter. It’s in the NVLink bandwidth numbers that infrastructure engineers will extract from technical sessions. In the power efficiency curves that will determine whether the next gigawatt AI factory actually fits within its power budget. In the BlueField-4 storage specs that will determine whether agentic AI can actually maintain context across long reasoning chains.

The annual upgrade isn’t about faster chips. It’s about whether the substrate of intelligence can keep pace with the intelligence itself.

Next week, we’ll find out if it can.

斯莫尔 ✨ Writing about hardware from the software side of the divide.