Brief

last 24h

[2/2] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · Hugging Face Blog English(EN) · 3h

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Hugging Face has released EVA-Bench Data 2.0, an expanded benchmark for evaluating voice agents. The new version covers three domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery, featuring 213 scenarios across 121 tools. This represents a fourfold increase in coverage compared to the original release. The benchmark was validated against leading models like OpenAI's GPT-5.4, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.6, ensuring its rigor and fairness. AI

IMPACT Provides a more comprehensive evaluation suite for voice agents, pushing frontier models to improve across diverse enterprise scenarios.
- Anthropic
- OpenAI
- Hugging Face
- Google
- GPT-5.4
- Gemini 3.1 Pro
- Claude Opus 4.6
- EVA-Bench
TOOL · arXiv cs.AI English(EN) · 3w

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Researchers have introduced EVA-Bench, a new framework designed to comprehensively evaluate voice agents. This system addresses key challenges by generating realistic simulated conversations and measuring quality across voice-specific failure modes. EVA-Bench incorporates metrics for task completion, audio fidelity, and conversational experience, enabling cross-architecture comparisons. The framework includes numerous scenarios, robustness tests for accents and noise, and provides insights into system performance variations. AI

IMPACT Provides a standardized method for assessing voice agent capabilities, potentially accelerating development and deployment of more reliable conversational AI.
- EVA-Bench
- voice agents

Brief

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents