Fast AI Inference

Industry-leading accuracy, throughput, latency, & cost with Groq® LPU™ AI inference technology

Multiple Modalities

Text

Text

Llama 3.3 70B Spec Decode

at 1,718 T/s

Audio

Audio

Distil-Whisper

at 203 T/s

Image

Image

Llama 3.2 11B (Vision)

at 751 T/s

A New Class of Inference Solution

Historically, most AI investments have focused on training. We are now at an inflection point where trained AI models must move into production, to inference. Inference uses input data to solve real-world challenges. Inference also requires performance at speeds that deliver instant results. To meet this moment, Groq created LPU AI inference technology, providing the speed developers need and the scalability industry requires.

Meet the LPU

Groq LPU AI inference technology is built for speed, affordability, and energy efficiency. Groq created the LPU (Language Processing Unit) as a new category of AI processor because the demand for inference is accelerating and legacy technology can’t deliver the instant speed, scalability, and low latency the market requires

How the LPU Is Different Than a GPU

The LPU is fundamentally different from a GPU, which was originally designed for graphics processing.

  • Groq Compiler is in control, not secondary to hardware
  • Compute and memory are co-located on the chip, eliminating resource bottlenecks
  • Kernel-less compiler makes it easy and fast to compile new models
  • No caches and switches means seamless scalability

Groq Speed Is Instant

From 10k to 100k tokens on a model like Llama 3.1 8B and 3.3 70B, you can see Groq performance via independent benchmarks. Where others shortcut context length to compete on speed, Groq never cuts corners to deliver performance, quality, or cost advantages.

Leading Openly-available
AI Models

Take advantage of fast AI inference performance for leading openly-available Large Language Models and Automatic Speech Recognition models, including:

  • Llama 3 8 & 70B
  • Llama 3.1 8B
  • Llama 3.2 1 & 3B
  • Llama 3.2 11 (Vision) & 90B (Vision)
  • Llama 3.3 70B
  • Mixtral 8x7B
  • Gemma 2 9B
  • Whisper Large V3

Fast Inference API for Half a Million Developers

Groq inference is available now to our community of over half a million developers. There is no waitlist for an API key and we’re giving away five billion tokens per day for free.

Groq Inference Resources

Launch Demo
Project Media QA: Summarize and ask questions about online media content
Read Use Case

Vectorize: A powerful RAG experimentation and pipeline platform

Read Use Case

Real-time Inference for the Real World: Athena Intelligence

Never miss a Groq update! Sign up below for our latest news.