One Chip. One Quadrillion Operations per Second.
Every aspect of the Tensor Streaming Processor is designed in pursuit of performance. Instead of creating a small programmable core and replicating it dozens or hundreds of times, the TSP houses a single enormous processor that has hundreds of functional units. This novel architecture greatly reduces instruction-decoding overhead, and handles integer and floating-point data, which makes delivering the best accuracy for inference and training a breeze.
PCIe Gen4x16 Support
Up to 17x faster than our competition
Unparalleled Agility. All the time.
Every non-Groq accelerator requires a tradeoff between optimal responsiveness (i.e. minimum latency) and maximum performance. When even just a small percentage of the workload requires millisecond responses, a traditional GPU’s performance quickly degrades. This means deploying more racks, more power, and more maintenance costs. With Groq, there is no tradeoff. Responsiveness and performance go hand in hand, providing unparalleled agility. It only takes 10 two-card Groq servers, compared to 24 of our competitor’s two-card servers, to get the same performance.
Forget everything you know about chip architecture.
We’ve taken an entirely new architectural approach to accelerating neural networks. Our Tensor Streaming Processor with a single core SIMD engine, streaming registers and unified memory is driven by software orchestrated execution.
1 PetaOp/second @ 1.25 GHz
250 TFLOPS @ 1.25 GHz
220 MB on-die
80TB/s on-die memory bandwidth
PCIe Gen 4 x16
31.5 GB/s in each direction
Up to 16 chip to chip interconnects
Open source, simple stack
End to end on chip protection