EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries
Researchers have introduced EHRNote-ChatQA, a novel benchmark designed to evaluate multi-turn clinical question answering over longitudinal patient discharge summaries. This benchmark, derived from de-identified MIMIC-IV data, features over 16,000 expert-verified question-answer pairs across 967 patient-level samples. Initial evaluations of 22 LLMs indicate significant challenges in evidence grounding and error compounding across multiple turns, suggesting that performance on single-turn clinical QA does not reliably translate to this more complex setting. AI
IMPACT Establishes a new evaluation standard for clinical LLM applications, highlighting current limitations in evidence grounding and multi-turn reasoning.