Groq Adds Responsiveness to Inference Performance to Lower TCO

Written by:

Groq

Running a batch size of one, which refers to computations on a single image or sample during inference processing, is a valuable option for many machine learning workflows – particularly those that require real-time responsiveness. However, small batch sizes and batch size 1 introduce a number of performance and responsiveness complexities to machine learning applications, particularly with conventional inference platforms based on GPUs.

When used as a processing unit in machine learning applications, GPUs experience significant declines in performance when fed data in small batch sizes, because gaps in the data flow introduce stalls in GPU execution. This latency can introduce a dramatic impact on real-time inference performance at batch size 1, with associated impacts on TCO (total cost of ownership) for machine learning platform investments.

Groq offers an alternative platform for machine learning that is not dependent on the performance limitations of GPU parallelism and long latency, offering machine learning engineers a superior inference platform with no performance penalty for small batch sizes – or workloads of any size.

We’ve published a new white paper that examines processing performance and responsiveness of multiple machine learning platforms, comparing performance of clusters based on GPUs with clusters based on Groq’s tensor streaming processor (TSP). The results are startling. Groq’s architecture is nearly 2.5 times faster than GPU-based platforms at large batch sizes, and 17.6 times faster at batch size 1.

Groq’s new, simpler process architecture is designed specifically for the performance requirements of machine learning and other applications requiring accelerated computation, without the latency of conventional GPUs that trade off high responsiveness with low performance, and high performance with low responsiveness.

This performance and latency differential becomes especially meaningful when extrapolated into purchasing parameters for compute cluster design, because to achieve the inference processing performance of a small number of Groq-based servers would require investment in a much larger number of GPU-based servers.

In our white paper, we describe a compute cluster designed for ResNet-50 image classification, with a workload distribution that includes a mix of high responsive processing (with latency of less than one millisecond) and other workloads with higher or no latency requirements. Because Groq-based clusters can much more efficiently process small batch size workloads than GPU-based clusters can, we demonstrate that deploying Groq-based nodes for the same image classification workloads results in a cluster with 75 percent fewer servers while maintaining the same throughput, due to an overall 16x performance disparity between NVIDIA V100 GPUs and Groq processors.

This performance disparity has dramatic implications for infrastructure investment, with the potential to significantly reduce TCO for organizations purchasing machine learning platforms.

To learn how Groq’s more efficient processing architecture can deliver higher responsiveness and lower TCO, download our new white paper The Challenge of Batch Size 1: Groq Adds Responsiveness to Inference Performance.

Never miss a Groq update! Sign up below for our latest news.

Groq Adds Responsiveness to Inference Performance to Lower TCO

The latest Groq news. Delivered to your inbox.