AssemblyAI has released benchmarks for real-time speech-to-text (STT) latency, emphasizing that the lowest latency does not always equate to the best performance for voice agents. The company argues that "fast enough plus accurate" is superior to "fastest but wrong," as voice agents require a balance between speed and accuracy to avoid misinterpreting crucial information. AssemblyAI highlights key metrics like Time to First Token (TTFT) and Time to Complete Turn (TTCT), stressing the importance of P95 latency for production environments over median (P50) latency. Their Universal-3.5 Pro Realtime model reportedly achieves a competitive 6.99% word error rate on real-world voice agent audio benchmarks. AI
IMPACT Highlights the critical balance between speed and accuracy for voice agents, influencing STT model selection.
RANK_REASON Product benchmark release from a non-frontier lab.
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →