Hugging Face has released EVA-Bench Data 2.0, an expanded benchmark for evaluating voice agents. This new version broadens its scope to three enterprise domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery. The updated dataset includes 213 scenarios across 121 tools, a significant increase from its previous iteration, and has been validated against leading models like GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6. AI
IMPACT Provides a more comprehensive and realistic evaluation framework for voice agents, pushing development towards better handling of complex enterprise tasks.
RANK_REASON Release of a new version of an evaluation benchmark dataset with expanded scope and scenarios.
- Anthropic
- Claude Opus 4.6
- EVA-Bench
- Gemini 3.1 Pro
- GPT-5.4
- Hugging Face
- OpenAI
- Anthropic Claude Opus 4.6
- Google Gemini 3.1 Pro
- OpenAI GPT-5.4
- ServiceNow
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →