Announcements

Posts

We’re looking for creative problem solvers to help us build the next generation of #MachineLearning Systems! Apply here https://groq.com/careers/?gh_jid=5090081003

Insights

Accelerating Systems with Real-time AI Solutions

Built for leaders needing the fastest time-to-value

Groq offers comprehensive end-to-end acceleration solutions from scalable, ultra low latency systems to generalized software. We improve results by orders of magnitude for our customers needing to modernize underperforming systems. Groq increases performance and enables innovation unlike any other technology provider.

Our First Generation Solutions

GroqCard

GroqCard™

A single GroqChip in a PCIe Gen4 x16 interface with 230 MB of on-die memory delivers up to 750 TOPs, 188 TFLOPs (INT8, FP16 @900 MHz).

GroqNode

GroqNode™

Eight GroqCard accelerators in a rack-ready 4U server chassis supports millions of parameters and is a scalable building block for a single global hop network, delivering up to 6 POPs, 1.5 PFLOPs (INT8, FP16 @900MHz).

Groq Product Graphics 3 GroqRack 23

GroqRack™

Eight plus one interconnected GroqNode servers delivers low system latency of ~1.6 µsec and up to 48 POPs, 12 PFLOPs (INT8, FP16 @900MHz).

GroqCloud

GroqCloud™

GroqCloud delivers 216 POPs, 54 PFLOPs (INT8, FP16) and is growing.

Groq offers deployment optionality to meet customers where they need performance most, while our architecture and software tools improve developer velocity and deliver industry-leading results

Redefining the Developer Experience

  • Groq™ Compiler: Out-of-the-box applicability, serving the majority of industry standard models
  • Groq API: Meets tailored solution needs with fine-grained control
  • Productivity Tools: GroqView™ Visualization and Profiler, Performance Estimator, and GroqFlow™ Tool Chain
  • Try Out Groq: Contact us for access to the GroqWare™ Developer Tools Package

Providing Value to Customers Today

Key Advantages and Differentiators

GroqCloud™

GroqCloud can be used today, from anywhere, to take advantage of our architectural innovation providing synchronous, deterministic, ultra-low latency compute solutions, with the same quality of our on-premise products. The advantages of the Tensor Streaming Processor (TSP) architecture include predictability that allows customers a more finite control over usage/inference and cost. The developer can reduce time-to-production using our generalized compiler and software solutions that exist as part of our cloud service without the need to stand-up their own racks or systems. Access via GroqCloud means scalability as the workload or model demands it. Groq is focused on TCO, developer velocity, ease of use, and time-to-results, which are best facilitated via our cloud and co-cloud options.

Predictability

Your inferences - exactly on time, every time.

Determinism means predictable and repeatable performance, while eliminating tail latencies and correction of all single-bit errors in hardware, preventing uncorrectable multi-bit errors. The total execution time is known at compile time. This means knowing a workload’s exact runtime and order of operations—with zero variance. Regardless of whether it’s one GroqChip™ or a network of chips, workloads run identically the first time and the millionth time—with no performance variation. Better for planning, better for safety, better for budgeting—and better for your developers.

Determinism plus low latency equals cutting-edge predictability, enabling true QoS (Quality of Service) with Groq.

Consistent & predictable execution time Per Pass (execution time) Groq Variable execution times Per Pass (execution time) Others
one
Developers know total execution time at compile time, with no need for error checking.
Tail latencies and performance variation are a thing of the past.
Better for planning, safety, and budgeting.

Low Latency

All your inferences - right now.

Groq offers the lowest latency machine learning architecture on the market. This means ~18,800 IPS on ResNet 50 (at batch-1) with a latency of ~0.05ms. And unlike others we don’t compromise latency for throughput, we maintain high throughput in batch 1. Our batch 1 performance works very differently compared to high batch performance GPUs. We don’t wait to collect data and then send it to a processor. Give us one piece of data, we will give you the result. The response? Lower latency, more accuracy, and less energy consumed. Image identification, segmentation, natural language processing, and machine vision rely upon low latency, predictability, and accuracy so identification and decisions are timely enough to matter. For your real-time, mission critical applications lower latency is the difference maker.

Low batch size advantage Groq Others
The lowest latency machine learning architecture on the market, enabling more complex algorithms.
No compromising latency for throughput.
Peak performance at batch 1, ideal for low latency, mission critical applications.

Velocity

Insights to models - fast.

At Groq, we’re fundamentally changing the developer’s human experience. High performance, low power, low latency are all wonderful technology features, but we’re removing what limits how humans innovate—which happens to be the biggest bottleneck in the industry.

Our chips, nodes, and networks are deterministic. Here, determinism equals industry-best predictable performance—and that’s what speeds up developers. Within seconds of a developer hitting compile on their code, Groq’s static profiling provides an ultra-detailed performance report and visualization of the entire chip’s compute and memory usage for the whole program. The developer has an immediate and accurate summary of performance metrics like latency, utilization and bandwidth, power, memory and FLOP occupancy. Suddenly, the slow, painful dynamic profiling process is gone. As for competitor configuration changes, noisy neighbors require continuous error removal. Groq’s TSP architecture means workloads are executed identically every time so you never have to debug the same program twice.

velocity developer enabled Determinism- Groq process dynamic profiling Slow, painful Others
Groq enables developer velocity with determinism, delivering insights to models faster than ever.
GroqView™ profiler provides performance reports and visualizations of the entire chip’s compute and memory usage for a program at compile time - say goodbye to the slow dynamic profiling process.
Determinism equals predictive execution - no debugging the same program twice, ever.

Accuracy

True numerical reality - consistently.

Uniquely, the GroqChip™ accelerator contains both floating point (FP) and integer, so you can train and deploy inference with the same device. Groq TruePoint™ technology doesn’t sacrifice between accuracy or performance across artificial intelligence (AI), machine learning (ML), deep learning (DL), and linear algebra computing by working on floating point calculations. Most solutions train on one device and infer on another using distinct hardware types with dissimilar numerics, resulting in unforeseen compatibility issues and increased TCO. Groq’s Tensor Streaming Processor (TSP) architecture is more versatile, employing a single source of numerics. TruePoint enables accuracy at lower input bit sizes which results in better performance, lower power, and less memory usage for your workloads. The benefit for large complex iterative algorithms is less accumulated error, for example, a training algorithm can see faster convergence. A world with TruePoint means performance and accuracy, rather than the constant decision between the two.

TruePoint™ technology implements a single source of numerics. works on floating point calculations.
Training and inference deployment on the same device.
Delivers reproducible high accuracy and precision without sacrificing performance.

Scalability

Inference to training - one device, scaling near-linearly.

The biggest advantage of Groq’s deterministic architecture is that it scales. Our RealScale™ chip-to-chip network contains no switches or CPUs— GroqChip™ and GroqCard™ are all direct connections, orchestrated by GroqWare™ suite (API & compiler). Networking is built into the chip, so racks are as efficient as a card across both inference and training. This in turn eliminates the need to move data with extra silicon, saving power and improving on overall TCO. From HPC to ML, RealScale offers near-linear scaling from just one to thousands of connected devices.

GroqRack 64 Chips GroqNode 8 Chips GroqCard 2 Chips GroqChip 1 Chip 50,000 45,000 40,000 35,000 30,000 25,000 15,000 10,000 5,000 0 TOps
RealScale™ chip-to-chip network contains no switches or CPUs - networking is built into GroqChip.
Groq hardware scales near linearly from one to thousands of devices, from inference to training, with no efficiency trade-offs.
Less data movement means more power savings and improved TCO.