Build Fast with Text-to-Speech

Written by:
Groq
Share:

Groq & PlayAI partner to bring Dialog, a leading TTS model, to GroqCloud™ for real-time voice applications

One of the most popular emerging applications for applied AI has been generative voice systems that converse with customers for services like customer support or appointment scheduling. In our world, the responsiveness and the emotional authenticity of the AI system are key to success. Delivering on these features requires fast AI inference and building conversational AI interfaces that people can actually use, and historically, there have been compromises between how well it works and how much it costs.

That’s why Groq and PlayAI are partnering to leapfrog the world to the next generation of conversational human-AI interactions.

PlayAI is a leading provider of advanced Text-to-Speech (TTS) voice AI models based on LLMs, and provides these through an API to developers, and an AI voiceover studio to creators. Groq is a leading AI infrastructure provider, making fast and affordable inference available to enterprises and developers globally with GroqCloud™. 

Together, these innovators are now running PlayAI’s Dialog, the one of the most advanced TTS models on the market today, at 140 characters per second on GroqCloud. Today, Dialog powered by Groq offers both English and Arabic endpoints, with several additional languages coming soon. This is the first Arabic voice AI for the Middle East, and one that captures the nuance of Arabic as spoken in Saudi Arabia. This model is running in-region from data centers based in the Kingdom of Saudi Arabia. 

This is an exciting step for GroqCloud as we expand our capabilities with a new modality with the help of a leading model provider in the TTS model space. With this addition developers can now build end-to-end voice applications fully powered by Groq. Builders can access Dialog on GroqCloud.

Performance

Based on initial internal testing, Groq is delivering up to 140 characters /s on PlayAI’s Dialog model, a significant boost compared to the same model running on GPUs at 80 characters /s. That means that Dialog generates text up to 10x times faster than real-time.This performance boost comes with a low Word Error Rate (WER) of just 2.15% giving users the best of both worlds fast and high quality TTS.

Cost

On-demand pricing for this model is available to GroqCloud users at a competitive price of $50 / 1M characters. See more on Groq pricing here

Why PlayAI Dialog

PlayAI Dialog is a voice model built for fluid, emotive conversation. The end-to-end AI speech model uses a conversation’s historical context to control prosody (intonation, pacing of speech, emotion) to deliver more natural sounding speech, setting new standards for matching how humans speak in real-life situations. PlayAI Dialog helps create authentic conversational experiences like narration, synthetic podcasts, and supporting immersive and engaging 1:1 voice experiences with customers in business contexts.

PlayAI Dialog was trained on hundreds of millions of conversations across over 30 languages that represent real-world examples. Because Dialog was trained on both single speaker and multi-speaker conversations, it closely matches human speech on prosody (pacing, intonation, emotion), meaning it feels more real to those interacting with it. 

In blind testing, PlayDialog beta outperformed the leading competitive models in the market by 3:1, with expressiveness scoring highest as a factor for the preference.

Context is Key

Unlike previous generations of speech models, Dialog understands the entire conversational context and how each sentence, or speaker, influences speech generation. PlayAI built a novel architecture that allows the model to use the full context and history of a conversation, meaning that every response isn’t just a standalone output; it’s enriched with appropriate rhythm, tone, and emotion that reflect the flow of the conversation. 

By capturing these nuances across entire conversations, generated speech is far more natural and human sounding. This means that synthetic podcasts now sound like the speakers are in the same room and responding to each other, narration can sound exciting and engaging, and speech generated in voice agent applications can respond with emotion and pacing that matches the conversation context.

Build fast with PlayAI Dialog, running on GroqCloud

The latest Groq news. Delivered to your inbox.