Researchers have introduced HEALTHDIAL, a new large-scale, multilingual dataset designed for developing and evaluating retrieval-augmented generation (RAG) systems in spoken dialogue. The dataset includes 6,000 information-seeking dialogues across Arabic, Chinese, English, and Spanish, grounded in World Health Organization (WHO) content. It also features 163 hours of recorded speech from native speakers and detailed demographic and sociolinguistic annotations. Initial benchmark results indicate performance disparities across languages, even for those considered high-resource. AI
IMPACT Enables development and evaluation of multilingual spoken dialogue systems, potentially improving access to health information.
RANK_REASON The cluster describes the release of a new academic dataset for AI research.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →