One Chip. One Quadrillion Operations per Second.

Every aspect of the Tensor Streaming Processor is designed in pursuit of performance. Instead of creating a small programmable core and replicating it dozens or hundreds of times, the TSP houses a single enormous processor that has hundreds of functional units. This novel architecture greatly reduces instruction-decoding overhead, and handles integer and floating-point data, which makes delivering the best accuracy for inference and training a breeze.

220 MB



PCIe Gen4x16 Support

1000 TOp/s

Up to 17x faster than our competition


Unparalleled Agility. All the time.

Every non-Groq accelerator requires a tradeoff between optimal responsiveness (i.e. minimum latency) and maximum performance. When even just a small percentage of the workload requires millisecond responses, a traditional GPU’s performance quickly degrades. This means deploying more racks, more power, and more maintenance costs. With Groq, there is no tradeoff. Responsiveness and performance go hand in hand, providing unparalleled agility. It only takes 10 two-card Groq servers, compared to 24 of our competitor’s two-card servers, to get the same performance.

Why does batch size 1 matter?


Forget everything you know about chip architecture.

We’ve taken an entirely new architectural approach to accelerating neural networks. Our Tensor Streaming Processor with a single core SIMD engine, streaming registers and unified memory is driven by software orchestrated execution.

Read more on the Linley Group’s Report


  • INT8 Performance

    1 PetaOp/second @ 1.25 GHz

  • FP16 Performance

    250 TFLOPS @ 1.25 GHz

  • Transistors

    26.8B transistors

  • Process


  • SRAM

    220 MB on-die

  • Memory Bandwidth

    80TB/s on-die memory bandwidth

  • Host Interface

    PCIe Gen 4 x16
    31.5 GB/s in each direction

  • C2C Interface

    Up to 16 chip to chip interconnects

  • Driver

    Open source, simple stack

  • ECC

    End to end on chip protection