Quick links

Groq Chip

Real-time
AI & HPC.

Uncompromised Low Latency.

Legacy systems force an impossible choice – throughput or latency. An intolerable compromise for mission-critical workloads.

Groq products are built to revolutionize workloads in artificial intelligence (AI) and high performance computing (HPC), addressing the evolving industry’s demands while accelerating developer velocity all together.

Our innovative deterministic single-core streaming architecture lays the foundation for Groq compiler’s unique ability to predict exactly the performance and compute time for any given workload. The result is uncompromised low latency and performance, delivering real-time AI and HPC.

How did we do this? Starting from scratch with a completely original idea, following our belief in first principles – we’re radically simplifying compute.

Scroll to learn more
We designed a simple, elegant processor architecture, from the ground up, to accelerate complex workloads in artificial intelligence, machine learning, and high performance computing.

Instead of developing a small programmable core and replicating it hundreds of times, the Tensor Streaming Processor (TSP) architecture houses a single processor with hundreds of functional units. Our software compiles deep learning models into instruction streams, all orchestrated in advance. We moved control into the software stack so we could extract more silicon performance to deliver speedy results.

GroqChip™ Processor

Simplified single-core design enables compute performance

Competitor Chip

Complex multi-core design for processing data compute costs

With most processor transistors, 90% of the effort is controlling and moving data. Our simple architecture means that 95% of transistors focus on actually solving your problem.

How did we do it? With an amazing team of innovators. But it’s more than innovation, it’s committing to helping other people innovate. Doesn’t matter whose idea it is. With teams, the result is always better. That’s our magic.

And it works. Our software-driven solutions for compute-intensive applications have delivered industry-leading performance, accuracy and sub-millisecond latency. And we have shown that developer velocity is possible because of our software-defined compute.

Predictability

Your inferences - exactly on time, every time.

Determinism means predictable and repeatable performance, while eliminating tail latencies and correction of all single-bit errors in hardware, preventing uncorrectable multi-bit errors. The total execution time is known at compile time. This means knowing a workload’s exact runtime and order of operations—with zero variance. Regardless of whether it’s one GroqChip™ or a network of chips, workloads run identically the first time and the millionth time—with no performance variation. Better for planning, better for safety, better for budgeting—and better for your developers.

Determinism plus low latency equals cutting-edge predictability, enabling true QoS (Quality of Service) with Groq.

Consistent & predictable execution time Per Pass (execution time) Groq Variable execution times Per Pass (execution time) Others
one
Developers know total execution time at compile time, with no need for error checking.
Tail latencies and performance variation are a thing of the past.
Better for planning, safety, and budgeting.

Low Latency

All your inferences - right now.

Groq offers the lowest latency machine learning architecture on the market. This means ~18,800 IPS on ResNet 50 (at batch-1) with a latency of ~0.05ms. And unlike others we don’t compromise latency for throughput, we maintain high throughput in batch 1. Our batch 1 performance works very differently compared to high batch performance GPUs. We don’t wait to collect data and then send it to a processor. Give us one piece of data, we will give you the result. The response? Lower latency, more accuracy, and less energy consumed. Image identification, segmentation, natural language processing, and machine vision rely upon low latency, predictability, and accuracy so identification and decisions are timely enough to matter. For your real-time, mission critical applications lower latency is the difference maker.

Low batch size advantage Groq Others
The lowest latency machine learning architecture on the market, enabling more complex algorithms.
No compromising latency for throughput.
Peak performance at batch 1, ideal for low latency, mission critical applications.

Velocity

Insights to models - fast.

At Groq, we’re fundamentally changing the developer’s human experience. High performance, low power, low latency are all wonderful technology features, but we’re removing what limits how humans innovate—which happens to be the biggest bottleneck in the industry.

Our chips, nodes, and networks are deterministic. Here, determinism equals industry-best predictable performance—and that’s what speeds up developers. Within seconds of a developer hitting compile on their code, Groq’s static profiling provides an ultra-detailed performance report and visualization of the entire chip’s compute and memory usage for the whole program. The developer has an immediate and accurate summary of performance metrics like latency, utilization and bandwidth, power, memory and FLOP occupancy. Suddenly, the slow, painful dynamic profiling process is gone. As for competitor configuration changes, noisy neighbors require continuous error removal. Groq’s TSP architecture means workloads are executed identically every time so you never have to debug the same program twice.

velocity developer enabled Determinism- Groq process dynamic profiling Slow, painful Others
Groq enables developer velocity with determinism, delivering insights to models faster than ever.
GroqView™ profiler provides performance reports and visualizations of the entire chip’s compute and memory usage for a program at compile time - say goodbye to the slow dynamic profiling process.
Determinism equals predictive execution - no debugging the same program twice, ever.

Accuracy

True numerical reality -
consistently.

Uniquely, the GroqChip™ accelerator contains both floating point (FP) and integer, so you can train and deploy inference with the same device. Groq TruePoint™ technology doesn’t sacrifice between accuracy or performance across artificial intelligence (AI), machine learning (ML), deep learning (DL), and linear algebra computing by working on floating point calculations. Most solutions train on one device and infer on another using distinct hardware types with dissimilar numerics, resulting in unforeseen compatibility issues and increased TCO. Groq’s Tensor Streaming Processor (TSP) architecture is more versatile, employing a single source of numerics. TruePoint enables accuracy at lower input bit sizes which results in better performance, lower power, and less memory usage for your workloads. The benefit for large complex iterative algorithms is less accumulated error, for example, a training algorithm can see faster convergence. A world with TruePoint means performance and accuracy, rather than the constant decision between the two.

TruePoint™ technology implements a single source of numerics. works on floating point calculations.
Training and inference deployment on the same device.
Delivers reproducible high accuracy and precision without sacrificing performance.

Scalability

Inference to training - one device, scaling near-linearly.

The biggest advantage of Groq’s deterministic architecture is that it scales. Our RealScale™ chip-to-chip network contains no switches or CPUs— GroqChip™ and GroqCard™ are all direct connections, orchestrated by GroqWare™ suite (API & compiler). Networking is built into the chip, so racks are as efficient as a card across both inference and training. This in turn eliminates the need to move data with extra silicon, saving power and improving on overall TCO. From HPC to ML, RealScale offers near-linear scaling from just one to thousands of connected devices.

GroqRack 64 Chips GroqNode 8 Chips GroqCard 2 Chips GroqChip 1 Chip 50,000 45,000 40,000 35,000 30,000 25,000 15,000 10,000 5,000 0 TOps
RealScale™ chip-to-chip network contains no switches or CPUs - networking is built into GroqChip.
Groq hardware scales near linearly from one to thousands of devices, from inference to training., with no efficiency trade-offs.
Less data movement means more power savings and improved TCO.