A recent benchmark of five voice AI stacks revealed that only two consistently responded under the critical 300ms latency threshold. The author found that voice-to-voice end-to-end models, which collapse STT, LLM, and TTS into a single process, significantly outperformed cascaded pipelines. These cascaded systems struggled to meet the latency demands due to serial processing of speech-to-text, LLM time-to-first-token, text-to-speech, and network round-trip times. The two fastest stacks were OpenAI's Realtime API with GPT-4o and LiveKit Agents with Google's Gemini 2.0 Flash. AI
IMPACT End-to-end voice models offer a path to significantly lower latency, improving user experience and enabling more natural conversational AI interactions.
RANK_REASON The article presents an independent benchmark and analysis of existing voice AI technologies, rather than a new release or product launch. [lever_c_demoted from research: ic=1 ai=0.7]
- AssemblyAI
- Cartesia Sonic
- Claude Sonnet 4.6
- Coqui XTTS
- Deepgram
- ElevenLabs Turbo v2.5
- Gemini 2.0 Flash
- GPT-4o
- LiveKit Agents
- Llama 3.3 70B
- OpenAI
- Pipecat
- Retell
- Whisper Large v3 Turbo
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →