PulseAugur
EN
LIVE 10:59:09

New benchmarks and frameworks enhance LLM reliability in clinical settings

Researchers have developed new benchmarks and frameworks to improve the reliability and safety of large language models (LLMs) in clinical decision-making. EHRBench and MedCase-Structured aim to evaluate LLMs on realistic electronic health record data, with EHRBench generating nearly one million question-answer items for diagnosis, treatment, and prognosis tasks. JMedEthicBench addresses the need for multi-turn conversational safety evaluations in Japanese, while SafeMed-R1 focuses on clinician-audited safety and ethics alignment. Additionally, MoBayes proposes a modular Bayesian framework to separate probabilistic reasoning from language generation for more reliable clinical decision support. AI

IMPACT These advancements aim to improve the safety, reliability, and equitable deployment of LLMs in healthcare by providing better evaluation tools and methods.

RANK_REASON Multiple research papers introducing new benchmarks, datasets, and frameworks for evaluating and improving LLMs in clinical settings.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 10 sources. How we write summaries →

COVERAGE [10]

  1. arXiv cs.AI TIER_1 English(EN) · Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang ·

    EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

    arXiv:2605.30637v1 Announce Type: new Abstract: Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support t…

  2. arXiv cs.AI TIER_1 English(EN) · Valentina Bui Muti, Eug\'enie Dulout, Ziquan Fu ·

    MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

    arXiv:2605.30295v1 Announce Type: cross Abstract: Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or …

  3. arXiv cs.AI TIER_1 English(EN) · Ziquan Fu ·

    MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

    Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the struct…

  4. arXiv cs.AI TIER_1 English(EN) · Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato ·

    JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models

    arXiv:2601.01627v3 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly E…

  5. arXiv cs.AI TIER_1 English(EN) · Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan, Yidong Jiang, Yankai Jiang, Jinru Ding, Jiayuan Chen, Zhuangzhi Gao, Pengcheng Chen, Zhao He, Rongzhao Zhang, Meiling Liu, Luyi Jiang, Jie Xu ·

    SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

    arXiv:2605.28338v1 Announce Type: new Abstract: Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to ad…

  6. arXiv cs.AI TIER_1 English(EN) · Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang ·

    Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

    arXiv:2605.28740v1 Announce Type: cross Abstract: As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain gen…

  7. arXiv cs.AI TIER_1 English(EN) · Daisy Zhe Wang ·

    Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

    As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the tok…

  8. arXiv cs.CL TIER_1 English(EN) · Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akiko Aizawa, Irene Li ·

    Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning

    arXiv:2601.08267v3 Announce Type: replace Abstract: While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deploymen…

  9. arXiv cs.AI TIER_1 English(EN) · Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang, Juexiao Zhou, Hao Peng, Yue Guo ·

    MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

    arXiv:2605.26567v1 Announce Type: new Abstract: Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text tr…

  10. arXiv cs.AI TIER_1 English(EN) · Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley ·

    MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

    arXiv:2604.20022v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an archite…