Real-Time Speech-to-Speech (s2s) Voice AI

19 Sep 2025

Elizabeth Long

Real-Time Speech-to-Speech Voice AI Added to Whippy

Whippy now supports real-time speech-to-speech AI models, enabling natural, instant voice conversations powered by OpenAI’s realtime models.

How It Works

Speech-to-speech (s2s) models process audio input and generate audio output directly. Unlike traditional setups that rely on separate speech-to-text (STT) and text-to-speech (TTS) systems, these models handle everything in one step. This creates a faster, smoother, and more natural experience inside Whippy.

Whippy now supports three realtime models:

gpt-4o-realtime – best for advanced, natural conversations
gpt-realtime – balanced for general use
gpt-4o-mini-realtime – lightweight, cost-efficient option

When setting up a voice agent, you can:

1. Select a realtime model based on performance vs. cost needs.

2. Pick a voice from OpenAI’s available options (male, female, neutral).

3. Start a live conversation directly in Whippy with real-time audio responses.

4. Reconfigure models or voices between sessions as needed.

Why It Matters

Low latency (~200ms): Conversations feel smooth and responsive.

Natural voices: High-quality, pre-trained options ensure human-like tone and flow.

Unified experience: A single model handles the process, reducing complexity and integration overhead.

This makes it easier to build real-time voice agents for sales calls, support lines, or interactive demos — all within Whippy, without relying on separate voice providers.