Instead of developing a small programmable core and replicating it hundreds of times, the Tensor Streaming Processor (TSP) architecture houses a single processor with hundreds of functional units. Our software compiles deep learning models into instruction streams, all orchestrated in advance. We moved control into the software stack so we could extract more silicon performance to deliver speedy results.
With most processor transistors, 90% of the effort is controlling and moving data. Our simple architecture means that 95% of transistors focus on actually solving your problem.
How did we do it? With an amazing team of innovators. But it’s more than innovation, it’s committing to helping other people innovate. Doesn’t matter whose idea it is. With teams, the result is always better. That’s our magic.
Determinism means predictable and repeatable performance, while eliminating tail latencies and correction of all single-bit errors in hardware, preventing uncorrectable multi-bit errors. The total execution time is known at compile time. This means knowing a workload’s exact runtime and order of operations—with zero variance. Regardless of whether it’s one GroqChip™ or a network of chips, workloads run identically the first time and the millionth time—with no performance variation. Better for planning, better for safety, better for budgeting—and better for your developers.
Determinism plus low latency equals cutting-edge predictability, enabling true QoS (Quality of Service) with Groq.
Groq offers the lowest latency machine learning architecture on the market. This means ~18,800 IPS on ResNet 50 (at batch-1) with a latency of ~0.05ms. And unlike others we don’t compromise latency for throughput, we maintain high throughput in batch 1. Our batch 1 performance works very differently compared to high batch performance GPUs. We don’t wait to collect data and then send it to a processor. Give us one piece of data, we will give you the result. The response? Lower latency, more accuracy, and less energy consumed. Image identification, segmentation, natural language processing, and machine vision rely upon low latency, predictability, and accuracy so identification and decisions are timely enough to matter. For your real-time, mission critical applications lower latency is the difference maker.
At Groq, we’re fundamentally changing the developer’s human experience. High performance, low power, low latency are all wonderful technology features, but we’re removing what limits how humans innovate—which happens to be the biggest bottleneck in the industry.
Our chips, nodes, and networks are deterministic. Here, determinism equals industry-best predictable performance—and that’s what speeds up developers. Within seconds of a developer hitting compile on their code, Groq’s static profiling provides an ultra-detailed performance report and visualization of the entire chip’s compute and memory usage for the whole program. The developer has an immediate and accurate summary of performance metrics like latency, utilization and bandwidth, power, memory and FLOP occupancy. Suddenly, the slow, painful dynamic profiling process is gone. As for competitor configuration changes, noisy neighbors require continuous error removal. Groq’s TSP architecture means workloads are executed identically every time so you never have to debug the same program twice.
Uniquely, the GroqChip™ accelerator contains both floating point (FP) and integer, so you can train and deploy inference with the same device. Groq TruePoint™ technology doesn’t sacrifice between accuracy or performance across artificial intelligence (AI), machine learning (ML), deep learning (DL), and linear algebra computing by working on floating point calculations. Most solutions train on one device and infer on another using distinct hardware types with dissimilar numerics, resulting in unforeseen compatibility issues and increased TCO. Groq’s Tensor Streaming Processor (TSP) architecture is more versatile, employing a single source of numerics. TruePoint enables accuracy at lower input bit sizes which results in better performance, lower power, and less memory usage for your workloads. The benefit for large complex iterative algorithms is less accumulated error, for example, a training algorithm can see faster convergence. A world with TruePoint means performance and accuracy, rather than the constant decision between the two.
The biggest advantage of Groq’s deterministic architecture is that it scales. Our RealScale™ chip-to-chip network contains no switches or CPUs— GroqChip™ and GroqCard™ are all direct connections, orchestrated by GroqWare™ suite (API & compiler). Networking is built into the chip, so racks are as efficient as a card across both inference and training. This in turn eliminates the need to move data with extra silicon, saving power and improving on overall TCO. From HPC to ML, RealScale offers near-linear scaling from just one to thousands of connected devices.