The Challenge
As AI solutions are evolving to a multimodal user experience (UX), consumers expect they can provide input and receive output via text, images, audio, video, and various data formats. You might ask the AI a question via voice while uploading a graph or chart, and receive responses via audio, text, and web interfaces, all simultaneously.
While this is the AI dream, making it a reality is quite challenging for AI developers. There are a few challenges to surmount:
- Developers creating multimodal solutions must employ multiple agents and models, each optimized for a particular mode. For example, they may use one model for voice I/O and another for image. That’s just for data I/O; they may also need a separate model (or models) for the actual reasoning of the solution.
- Getting the UX right is pretty straightforward in single mode solutions – the UX of entering a text query into an AI chatbot and getting a text answer doesn’t require a lot of extra work. There’s more orchestration required in a multimodal UX. Delivering a human-quality voice interaction while simultaneously displaying images or updating a website is much more challenging, especially with multimodal input. Complex reasoning is required to create a seamless, natural multimodal UX. For example, users may want to interrupt the app when it’s speaking, and the app must know where and how to pick up and what other questions to ask. All of this must be done at “human speed.”
- The interaction must be fast enough and multimodal outputs must be coordinated to create a natural experience. People might wait for a slower, clunkier text response, but are more demanding when speaking with and watching an AI solution.
- Building these systems is challenging and time consuming due to their complexity, high performance demands, and the relative lack of toolkits purpose-built for multimodality
The Solution
xRx, from 8090, is an open-source development framework that wraps multimodal input and output capabilities (the x in xRx) around a robust reasoning engine (the R) to support the creation and deployment of multimodal AI solutions. It includes agents to integrate audio, text, and other input/output modalities to the UX, reasoning agents that customize components to fit project needs, and quality guardrails to ensure output quality.
xRx features a dynamic UI that adapts to the conversation in real-time. This means that the UI will display relevant widgets and click events based on the user’s input, providing a more compelling and useful experience.
xRx runs on Groq® LPU™ AI inference technology, which delivers a natural, real-time UX. Every user interaction in an xRx built multimodal solution requires multiple calls to various agents. Running these solutions on traditional, GPU-based inference engines would be so slow as to render them unusable – imagine waiting 10 seconds for a response when you are ordering a pizza online or filling out a form. Chances are, you won’t, which is why Groq LPU AI inference technology performance is an absolute requirement.
xRx Architecture Features
- Client: A front-end application that handles UI rendering and WebSocket communication
- Orchestrator: Manages data flow between AI and traditional software components
- STT (Speech-to-Text) and TTS (Text-to-Speech): Seamlessly convert between audio and text
- Agent: A collection of reasoning agents forming the core “brain” of xRx
- Guardrails Proxy: A safety measure for responsible AI use that implements an optional moderation layer to filter unsafe content
The Opportunity
There are numerous scenarios where a multimodal UX and a robust and flexible reasoning engine can combine to create compelling and valuable AI solutions. xRx demos include its HIPPA-compliant patient intake solution for health care providers, and its customer service solution for quick service restaurants. Each of these demos are notable for the diversity of voice, tonality, style, and reasoning.
xRx features several sample reasoning systems for developer use:
- Simple Tool Calling App: This app demonstrates basic functionality and provides access to tools like weather and time retrievers and stock price lookups.
- Shopify Interaction App: This app uses Shopify APIs to demonstrate an intelligent voice assistant for e-commerce. It handles product inquiries, order placement, and customer service.
- Wolfram Assistant App: This app leverages Wolfram Alpha for mathematical and scientific queries.
- Patient Information App: This app collects and manages patient data in healthcare settings.
xRx solutions aren’t just multimodal, but each modality is complementary, creating a better overall UX. xRx aids developers to use various modalities (audio, text, images, video) in complementary ways. The result is a better overall UX, customized to the current situation, leading to better business outcomes. xRx seamlessly integrates audio and action in sync with reasoning. This creates a more natural and engaging conversational experience. xRx’s AI engine is aware of the multiple modalities, leading to a cohesive experience. This means that the AI can understand and respond to both voice and text input.
None of this works without instant inference from Groq and its LPU. Groq offers superior inference speed and quality and makes xRx-built solutions feasible. Run on a GPU, these solutions would simply be too slow. xRx and Groq represent a leap forward in developing multimodal conversational AI, supporting developers in creating next-generation AI solutions.
Tune into the 8090 and Groq webinar on October 8th at 9am PT and check out the GitHub repo here to learn more.