Real-Time Speech-to-Speech (s2s) Voice AI


Whippy now supports real-time speech-to-speech AI models, enabling natural, instant voice conversations powered by OpenAI’s realtime models.
How It Works
Speech-to-speech (s2s) models process audio input and generate audio output directly. Unlike traditional setups that rely on separate speech-to-text (STT) and text-to-speech (TTS) systems, these models handle everything in one step. This creates a faster, smoother, and more natural experience inside Whippy.
Whippy now supports three realtime models:
- gpt-4o-realtime – best for advanced, natural conversations
- gpt-realtime – balanced for general use
- gpt-4o-mini-realtime – lightweight, cost-efficient option
When setting up a voice agent, you can:
1. Select a realtime model based on performance vs. cost needs.
2. Pick a voice from OpenAI’s available options (male, female, neutral).
3. Start a live conversation directly in Whippy with real-time audio responses.
4. Reconfigure models or voices between sessions as needed.
Why It Matters
Low latency (~200ms): Conversations feel smooth and responsive.
Natural voices: High-quality, pre-trained options ensure human-like tone and flow.
Unified experience: A single model handles the process, reducing complexity and integration overhead.
This makes it easier to build real-time voice agents for sales calls, support lines, or interactive demos — all within Whippy, without relying on separate voice providers.