May 1, 2026 · 18 min LONG READ #AI#Nvidia#strategy#infrastructure#Jensen Huang#energy

Nvidia, or the Repricing of a Watt

Tokens per watt, bottleneck inversion, and the Hopper–Blackwell arithmetic.

Reading Nvidia: On strategy, defensibility, and capital in the AI era · PART 1 OF 7

Part 0 · Series overview 6 min · Apr 2026
Part 1 · Nvidia, or the Repricing of a Watt 18 min · May 2026 you are here
Part 2 · The Moat That Cannot Be Coded 24 min · May 2026
Part 3 · Pareto Frontiers and Lock-In 25 min · May 2026
Part 4 · The Return of the Real World upcoming
Part 5 · The Job and the Task upcoming
Part 6 · A Note on China upcoming
Part 7 · Drawing the AI Stack upcoming

When Dwarkesh Patel asks Jensen Huang to explain Nvidia in one sentence, in his recent podcast interview, Huang gives him this formula:

Input: electrons. Output: tokens. In the middle: Nvidia.

The formula sounds like a marketing line at first. Read carefully, it is also Huang’s bet on where AI value will accumulate over the next decade, and how Nvidia plans to keep capturing it.

The AI stack, and where Nvidia sits

Asked how he sees the AI industry, Huang draws a stack of five layers, with one specific twist: he separates chips and systems into distinct layers, when most analysts tend to merge the two into a single “compute” bucket.

The AI stack with sub-actors visible inside the chips and systems layers

Two things stand out from the diagram.

First, the stack is a relay race. Inside the chips and systems layers sit a sequence of handoffs that no single company controls. Nvidia¹ designs the logic and writes the software, then sends a digital file to TSMC² which prints the dies in Taiwan. SK Hynix³, Micron⁴ and Samsung make the HBM (high-bandwidth memory stacks bonded directly onto the package). An ODM in Taiwan⁵, typically Foxconn, Wistron, Quanta or Inventec, assembles the racks. A hyperscaler (AWS, Azure, Google Cloud, OCI) or a neocloud (CoreWeave, Nscale, Nebius) plugs them in. The model on top, GPT, Claude, Gemini, Llama, finally emits tokens.

Second, the choice to separate chips from systems is itself a strategic claim. Chips are (almost) commoditisable. AMD has chips. Broadcom designs ASICs⁶. SMIC⁷ fabricates them. Systems are not: they are the racks, the networking, the cooling, the software-hardware co-design that takes a chip and turns it into a thing that produces tokens at scale. By naming systems as a distinct layer, Huang is staking out the part of “compute” he thinks remains uncommoditisable, and where Nvidia’s investments in NVLink, NVSwitch, Mellanox and Spectrum-X are concentrated. Reading this stack tells you where Huang has decided to fight.

Nvidia, a software company (really?)

What Nvidia sells across the chain is the difficulty of turning a given watt into the largest possible number of valuable tokens. The silicon is incidental. Huang calls Nvidia a software company whose product is manufactured by others.

That framing should be read with caution. It is a very intentional financial reframing. Plenty of people on the sell side already argue Nvidia is overpriced; the more interesting move is to notice what the formula is doing to the analyst conversation. By describing himself as a software company, Huang is asking investors to ignore the capital intensity of his actual business and to anchor on software-company multiples.

And looking at the numbers, you could conclude he is right. Nvidia is a software company. The capex/revenue ratio is around 3%, against roughly 40% at TSMC and 25 to 35% at the hyperscalers. The balance sheet looks fabless. The margins look fabless. Gross margin sits at 75% in Q4 FY2026, against an industry median of 29%, an AMD at 45% and an Intel at 42%. By every standard ratio used in financial analysis, Nvidia behaves like a software company that happens to ship physical products.

But the framing is more fragile than it looks. Nvidia’s 75% gross margin is a historical peak, not a trend: the 10-year median is 62%, with a low of 57%. And the asset-light balance sheet hides roughly $250 billion in implicit purchase commitments to TSMC, the HBM makers and the ODMs (a number documented by SemiAnalysis), which behave like off-balance-sheet capex. Both points deserve much more space than this paper can give them, and we come back to them in the next entry of the series.

For now, accept the framing with the caution it deserves. The insight behind the formula stays: Nvidia’s core value proposition is transforming electrons into tokens. The question is whether that proposition will resist the same commoditisation pressure that is currently eroding the rest of the AI stack.

The thesis: bottleneck inversion

If AI is commoditising software, why would it spare the company that ships the silicon underneath? Huang’s reply:

The journey from watt to token has to be redone every year at a better cost.

Behind that sentence sits a thesis about where the binding constraint of AI is moving. For most of the last decade, the scarce resource was compute, and the right question was flops per dollar. Energy was effectively free. That world is ending. Data centres are now queuing for grid capacity. Gigawatt connections take years. Texas and Northern Virginia are interconnect-limited. The grid has become the bottleneck, not the chip.

And when the rare resource moves, the pricing power moves with it. Whoever maximises useful output per unit of the new scarce resource captures the surplus. Tokens per watt is the AI version of that principle.

From which three things follow.

1. The right benchmark has changed. Flops per dollar measures how cheaply you can do a single arithmetic operation. Tokens per watt measures how much economic output you can extract from a unit of energy. The first was the right question when chips were scarce. The second is the right question once grids are. Most of the conversations are still anchored on the first, partly out of habit and partly because chip scarcity has not disappeared, only become relatively less binding than energy. Both constraints exist at once. The point is which one decides the marginal investment.

2. The advantage is not buyable off the shelf. A tokens-per-watt advantage does not live in any single component. It lives in how silicon, networking, libraries, compilers and model-level optimisations are co-designed. That co-design has to be redone, end to end, every generation. Money helps. Money does not skip the work, because the work is institutional knowledge accumulated across thousands of engineers over twenty years of CUDA development. We come back to this point in the next entry of the series, because it deserves much more than a sentence.

3. The gap widens with cycles, not narrows. Whether the moat is real depends on a single ratio: how much of the Hopper-to-Blackwell efficiency jump came from physics versus from system design. If system design dominates, the gap between Nvidia and the next contender compounds rather than erodes, and every additional product cycle deepens it.

Doing the arithmetic

Between Hopper and Blackwell⁸, three years apart, Nvidia claims a 30x to 50x jump in energy efficiency at the same workload. That is a very large number. Where does it come from?

Some of it from lithography, the physical printing of smaller transistors onto silicon wafers⁹. The work is done by TSMC in Taiwan, using machines built almost exclusively by ASML in the Netherlands¹⁰. The standard pace of this layer is Moore’s law: roughly a doubling of transistor density every two years, which translates to about 25% to 30% better performance per watt per year. Over three years, that compounds to a factor of roughly 1.75x. That is what Hopper-to-Blackwell got from physics alone.

The rest is system design: architecture, networking, libraries, algorithms, all changed in concert. Do the arithmetic. Fifty divided by 1.75 is roughly 28. Even at the low end of Huang’s range, thirty divided by 1.75 is 17. Lithography buys you a factor of two. System design buys you a factor of seventeen to thirty.

If most of the gain in tokens per watt comes from how you design the stack rather than from how small you can print the transistors, three things follow:

1. Lithographic export controls become less effective every year. The current US strategy on China rests on cutting off access to ASML’s most advanced EUV machines, which are needed for sub-7nm chips. The implicit assumption is that being stuck on 7nm is a death sentence in AI. The Hopper-to-Blackwell number suggests otherwise. If a country with mature 7nm fabs (China, via SMIC) puts its energy and its researchers into the system-design layer, it can recover most of the lithographic gap on the software side¹¹. The export controls slow Chinese AI; they do not stop it. And every year that system design dominates physics, the controls bite less.

2. Hyperscaler ASICs need more than good silicon to compete. The standard argument for why Google’s TPU or Amazon’s Trainium will eventually displace Nvidia is that they are tailored to specific workloads and built without Nvidia’s margin. The 17-30x ratio says that the silicon is the easy part. The hard part is the software that turns the silicon into actual tokens per watt: kernels¹², scheduling¹³, networking¹⁴, all the layers above transistor count. Nvidia engineers, when they get inside a partner’s stack, have publicly delivered 2x or more on a meaningful share of kernels and up to 4x throughput in joint optimisation efforts¹⁵. That gap is the dollar value of CUDA. An ASIC team that wants to match it has to rebuild twenty years of accumulated kernel-level optimisation expertise.

3. Vera Rubin and Feynman are the test of the thesis. These are the names of Nvidia’s next two chip generations¹⁶, expected in 2026 and 2028. If they each deliver another 17-30x system-design advantage over physics alone, the moat widens with every cycle and the formula holds. If the architecture-level innovation flattens (because optimisation hits diminishing returns, because hyperscalers catch up, because Moore’s law itself stalls), Huang’s formula collapses. The thesis is empirically testable, and it will be settled within four years.

The standard commoditisation argument about AI focuses on the top two layers of the cake, models and applications, where prices are visibly falling. The tokens-per-watt thesis says the action is two layers down, in the place where watts are turned into economic output. Whether Huang is right will be settled within four years, by Vera Rubin and Feynman.

For now, the formula is intact, and it is a useful starting point for anyone trying to think clearly about where AI value accumulates next.

The interview also offers several other observations on what kind of moat actually defends this rente, and on how Huang himself thinks about strategic decisions in environments with lock-in. We return to both in the entries that follow.

“The journey from watt to token has to be redone every year at a better cost.”

Jensen Huang, on what resists commoditisation in AI

[Context] Nvidia designs the GPU architecture, the NVLink networking fabric, and the CUDA software stack. It does not own fabs. Founded 1993. Market cap roughly $4T at the time of the interview. What Nvidia sends to TSMC is a GDS2 file, the industry-standard digital description of a chip’s physical layout. ↩
[Context] Taiwan Semiconductor Manufacturing Company. Founded 1987. The world’s largest contract chip manufacturer, producing for Nvidia, Apple, AMD, Qualcomm and Broadcom. Runs the leading-edge nodes (N3, N2). Nvidia has been TSMC’s largest customer by revenue since 2024. ↩
[Context] SK Hynix. Korean memory maker. Dominant supplier of HBM3 and HBM3E to Nvidia. HBM (high-bandwidth memory) is stacked DRAM bonded directly to the GPU package, sitting next to the logic die rather than on a separate board. ↩
[Context] Micron. US memory maker. Third-largest supplier of HBM after SK Hynix and Samsung. Nvidia qualified Micron’s HBM3E in 2024 as part of a multi-sourcing strategy. ↩
[Context] ODM = Original Design Manufacturer. In the Nvidia supply chain, the four main ODMs (Foxconn, Wistron, Quanta, Inventec) are all based in Taiwan. They assemble GPUs, HBM, networking cards, power delivery and cooling into the finished rack, which then ships to hyperscalers or neoclouds. ↩
[Context] ASIC = Application-Specific Integrated Circuit. A chip designed for one workload (rather than a general-purpose GPU which can run many). In AI, the most prominent ASIC programmes are Google’s TPU, Amazon’s Trainium, Meta’s MTIA and Microsoft’s Maia. ASICs typically achieve better performance per watt on their target workload, in exchange for far less flexibility. ↩
[Context] SMIC = Semiconductor Manufacturing International Corporation. Shanghai-based, founded 2000, China’s largest contract chip foundry. Plays in China roughly the role TSMC plays globally, at one or two technology generations behind. Currently fabricates at 7nm using older DUV (deep ultraviolet) lithography with multi-patterning, since US export controls have blocked its access to ASML’s advanced EUV machines. Huawei’s most advanced AI chips, including the Ascend 910C, are made at SMIC. ↩
[Context] Hopper (H100, H200) is Nvidia’s GPU architecture released in 2022, named after Grace Hopper. It is the chip that trained most of the frontier models of 2023-2024 (GPT-4, Claude 3, Gemini). Blackwell (B100, B200, GB200), named after mathematician David Blackwell, is its 2024-2025 successor and the current top of Nvidia’s product line. The transition from one generation to the next is the basic unit on which Nvidia measures its own progress. ↩
[Context] A wafer is the round disc of silicon (typically 300mm in diameter today) onto which transistors are printed. A finished wafer is then cut into individual chips. Wafer cost, defect rates and processing time per layer are the three structural levers of chip economics. For more on the upstream physics of wafers and the constraints they impose on AI compute, see our deep dive: The Three Bottlenecks of AI Compute. ↩
[Context] Modern advanced chips are made in two coupled steps. First, ASML (Dutch company, near-monopoly) builds the lithography machines that print circuits onto silicon wafers using extreme ultraviolet light (EUV). Each machine costs $200-410 million. Second, TSMC (Taiwanese, around 90% of advanced foundry market) operates these machines inside its fabs and turns wafers into finished chips. Sub-7nm chips, including everything Nvidia ships today, require EUV. SMIC in China does not have access to ASML EUV machines (US export controls) and is therefore stuck on 7nm via older DUV multi-patterning techniques. ↩
[Caveat] “Recover most of the gap” should be read carefully. If Chinese system-design improvement velocity matches Nvidia’s, China still ends up with hardware that is roughly 2x worse on tokens per watt, because the lithographic gap (1.75x at Hopper-Blackwell pace) does not disappear, only becomes a smaller fraction of the total. The argument is not that the gap closes entirely. It is that the gap matters far less than the lithographic export controls assume. ↩
[Context] A kernel is a small piece of code that runs on the GPU and performs one specific computation, such as a matrix multiplication or an attention operation. Models are built by chaining thousands of these kernels. The same model running on the same hardware can be two to ten times faster or slower depending on how the kernels are written: how memory is accessed, how parallelism is exploited, how data flows between the GPU and the rest of the system. ↩
[Context] Scheduling is the layer that decides when and where each kernel runs on the GPU, how memory is shared between concurrent tasks, and how compute is overlapped with data transfers. A modern GPU has thousands of cores running in parallel. A bad scheduler leaves them idle, and a single training run can be 30% to 50% slower than necessary. ↩
[Context] Networking here means how dozens, hundreds or thousands of GPUs communicate when training or serving a model that does not fit on a single GPU. Inside a rack, GPUs talk over NVLink (Nvidia’s proprietary high-bandwidth bus). Between racks, they use InfiniBand or Spectrum-X (Nvidia’s Ethernet-based AI fabric). A 1000-GPU cluster can run at 30% or 90% of theoretical efficiency depending on networking quality. ↩
[Caveat] Huang’s “2x” in the interview is a confident average. Independent benchmarks suggest the real picture is more nuanced. A recent Cursor / Nvidia collaboration on Blackwell kernel optimisation (October 2025) reported a 38% geometric mean speedup across 235 production kernels, with 2x or more achieved on 19% of them. A separate vLLM / Nvidia collaboration delivered up to 4x throughput improvement at similar latency between Hopper and Blackwell. The 2x figure is therefore a real upper bound on a meaningful share of workloads, more than a guaranteed average. ↩
[Context] Vera Rubin (named after the astronomer who discovered evidence of dark matter) and Feynman (named after the physicist Richard Feynman) are Nvidia’s two next chip generations, expected in 2026 and 2028 respectively. Each one represents a roadmap commitment Nvidia has already made publicly to its supply chain and customers. The architecture-versus-physics ratio that defines the tokens-per-watt thesis will be empirically tested across these two cycles. ↩

← Return to the series overview: Reading Nvidia: On strategy, defensibility, and capital in the AI era

The AI stack, and where Nvidia sits

Nvidia, a software company (really?)

The thesis: bottleneck inversion

Doing the arithmetic

Footnotes