Hugging Face expands voice agent benchmark with 3 domains, 121 tools

By PulseAugur Editorial · [1 sources] · 2026-06-04 12:24

Hugging Face has released EVA-Bench Data 2.0, an expanded benchmark for evaluating voice agents. The new version covers three domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery, featuring 213 scenarios across 121 tools. This represents a fourfold increase in coverage compared to the original release. The benchmark was validated against leading models like OpenAI's GPT-5.4, Google's Gemini 3.1 Pro, and Anthropic's Claude Opus 4.6, ensuring its rigor and fairness. AI

IMPACT Provides a more comprehensive evaluation suite for voice agents, pushing frontier models to improve across diverse enterprise scenarios.

RANK_REASON The cluster describes the release of a new, expanded benchmark dataset for evaluating AI models, including details on its design and scope. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Blog →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

Hugging Face Blog TIER_1 English(EN) · 2026-06-04 12:24

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

COVERAGE [1]

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

RELATED ENTITIES

RELATED TOPICS