What NVIDIA Didn’t Say

Written by:
Groq

Hi everyone,

We were captivated by Jensen Huang’s opening keynote last week at the NVIDIA GTC. He did a masterful job, mixing in humor, a sly Taylor Swift reference, Michael Dell flyovers, and sci-fi references (Star Trek’s opening sequence and a callback to Silent Running’s droids).

And, of course, the Blackwell announcement. Let’s dig into that. What did Jensen say, and what didn’t he say? Here’s what we heard.  

What He Said

Jensen spent the first part of his talk describing how creating simulations and “digital twins” in the NVIDIA-hosted omniverse can create wonderful new solutions. Many of these solutions, though, require ever-larger models and real-time performance in inference. Which leads to a problem: compute. General computing has “run out of steam.” We need a new approach.

What is this new approach? Big. Blackwell big: 20 petaflops (up to 40 petaflops at FP4), 208 billion transistors, multi-trillion parameter large language models (LLMs). To paraphrase Tiny Elvis, that chip is huge.  

Jensen went on to talk about all the different things big compute will enable, including a whole host of “co-pilot”, “conversational”, and “digital twin” solutions. These solutions demand superior inference performance, which has been hard to achieve when the model is too big to fit on a single chip. Hence, there is a tradeoff between throughput and speed, so businesses need to choose which to prioritize. 

Jensen’s answer, once again, was big. 

What He Didn’t Say

While Jensen talked about getting bigger, he didn’t say anything about getting smarter. Nothing about changing the underlying approach to running models. Nothing about how to improve performance or efficiency through redesigned system architecture. Nothing about how there is another way to achieve the inference performance, cost, and energy objectives of the industry. 

We call this other way the Groq LPU™ Inference Engine, which we architected from the ground up to run solutions like LLMs and other generative AI models with 10X better interactive performance and 10X better energy efficiency. 

How does the LPU achieve such superior specs? When a model is compiled to run on the Groq LPU, the compiler partitions the model into smaller chunks which are spatially mapped onto multiple LPU chips. The result is like a compute assembly line. Each cluster of LPUs on this line are set up to run a particular compute stage, and they store all of the data needed to perform that task in their local on-chip SRAM memory. The only data they need to retrieve from other chips is the intermediate output that has been generated by either the previous compute stage or the current compute stage. This data transfer is entirely LPU to LPU, requiring no external HBM chips and no external router.  

This efficient, assembly line architecture is only feasible because the Groq LPU Inference Engine is entirely deterministic. That means that from the moment a new workload is compiled to run on Groq, the system knows exactly what is happening at each stage on each chip at each moment. This perfect determinism enables the assembly line to work at peak efficiency.  

Contrast this to how GPUs work. They operate in small chip clusters, and each cluster executes every sequential compute stage required to generate a token. For each stage, the GPUs retrieve all data required to execute the stage (model parameters, cumulative output of previous stages) from high bandwidth memory on another chip, and after they complete their task data goes back to the off-chip HBM. The architecture is non-deterministic, so all the data being shuttled around requires direction from a router, yet another external chip. This is inefficient and expensive.  

The Model T & the LPU

Another thing Jensen didn’t talk about: the Ford Model T. In 1913 Henry Ford transformed the automotive industry, and all of manufacturing, by adopting the assembly line method for building cars. Before then, a car under construction was stationary, while the workers who were building it moved back and forth, often retrieving required parts from nearby storage. 

Under Ford’s new assembly line system, the emerging car moved on a conveyor belt and the workers stayed in place, each performing just one or two tasks with all the parts they needed right at hand. It took 84 stations to assemble 3,000 parts into a Model T. Sound familiar? It should, because this is analogous to how LPUs create tokens versus how GPUs do it. 

The best way to improve an outdated process, such as the GPU’s inference architecture, is to supercharge it. For example, two of the highlights of the Blackwell announcement are its bigger HBM storage and bandwidth and its more powerful routers. They didn’t eliminate the need for separate HBM storage and router chips; they made bigger ones.  

These improvements are part of how Blackwell can achieve, according to one of the graphs displayed during the keynote, a substantial inference performance improvement over current GPUs. That same graph also highlights the limitations of GPU architecture. It featured throughput (tokens / GPU / second) on the Y axis and interactivity (tokens / user / second) on the X axis, to demonstrate how there is always going to be a tradeoff between the two. This tradeoff doesn’t exist for LPUs. If you maximize efficiency of token generation via an LPU’s assembly line architecture, you can optimize both interactivity and throughput!  

Furthermore, the X axis in Jensen’s  graph – interactivity – maxed out at 50 tokens / second. This may seem fast for a 1.8 trillion parameter model running on a GPU, which is what the graph represents, but it’s far too slow for the next generation of real-time AI solutions. To succeed, interactive co-pilots and conversational AI assistants will need to achieve human levels of fluidity in their responses and reasoning.  50 tokens / second won’t cut it.

Then there’s energy consumption. All that data moving back and forth between GPUs and HBM chews up a lot of joules, and supercharging the various components doesn’t change that. Scaling up usually yields economies, and Blackwell is indeed much more efficient. But it’s still shuttling data between chips for every single compute task. Economies of scale can’t fix that. By comparison, the Groq LPU Inference Engine is already at least 10X more energy efficient than GPUs, because its assembly line approach minimizes off-chip data flow.  

Faster Horses

Henry Ford is often quoted as allegedly saying, “If I would have asked people what they wanted, they would have said faster horses.” 

NVIDIA’s Blackwell isn’t just faster horses, it’s more of them, tied to more buggies, yoked together by an expanding network of harnesses. The scale is stupendous, the engineering remarkable, and, it’s still a horse and buggy architecture. That’s another thing Jensen didn’t say.

The latest Groq news. Delivered to your inbox.