Researchers have introduced MeDial-Speech, a new dataset designed to train and evaluate AI models for medical consultations. The dataset comprises over 111 hours of speech data from robot-patient and doctor-patient dialogues, covering four specific health conditions. It also includes a benchmark for sentence selection, which was used to test three leading LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Results indicated that Claude Sonnet 4 performed best on sentence selection, though all tested LLMs exhibited overconfidence in their predictions. AI
IMPACT This dataset and benchmark could accelerate the development and evaluation of AI systems for medical dialogue, potentially improving patient care and consultation efficiency.
RANK_REASON The cluster describes a new academic paper introducing a dataset and benchmark for evaluating LLMs in a specific domain. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →