新的基准和框架增强了 LLM 在临床环境中的可靠性

作者 PulseAugur 编辑部 · [10 个来源] · 2026-05-26 04:00

研究人员开发了新的基准和框架，以提高大型语言模型（LLM）在临床决策中的可靠性和安全性。EHRBench 和 MedCase-Structured 旨在基于真实的电子健康记录数据评估 LLM，其中 EHRBench 为诊断、治疗和预后任务生成了近一百万个问答项。JMedEthicBench 解决了日语多轮对话安全评估的需求，而 SafeMed-R1 则侧重于经过临床医生审计的安全性和伦理对齐。此外，MoBayes 提出了一个模块化贝叶斯框架，将概率推理与语言生成分离，以提供更可靠的临床决策支持。 AI

影响这些进展旨在通过提供更好的评估工具和方法，提高 LLM 在医疗保健领域的安全性、可靠性和公平部署。

排序理由多篇研究论文介绍了用于评估和改进临床环境中 LLM 的新基准、数据集和框架。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 10 个来源。我们如何撰写摘要 →

报道来源 [10]

arXiv cs.AI TIER_1 English(EN) · Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang · 2026-06-01 04:00

EHRBench：一个用于LLM临床决策的自动化可靠的基于EHR的基准测试

arXiv:2605.30637v1 Announce Type: new Abstract: Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support t…
arXiv cs.AI TIER_1 English(EN) · Valentina Bui Muti, Eug\'enie Dulout, Ziquan Fu · 2026-05-29 04:00

MedCase-Structured: 一个用于在临床现实电子健康记录环境中进行诊断推理基准测试的文本到FHIR数据集

arXiv:2605.30295v1 Announce Type: cross Abstract: Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or …
arXiv cs.AI TIER_1 English(EN) · Ziquan Fu · 2026-05-28 17:42

MedCase-Structured: 一个用于在临床现实电子健康记录环境中进行诊断推理基准测试的文本到FHIR数据集

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the struct…
arXiv cs.AI TIER_1 English(EN) · Junyu Liu, Zirui Li, Qian Niu, Zequn Zhang, Yue Xun, Wenlong Hou, Shujun Wang, Yusuke Iwasawa, Yutaka Matsuo, Kan Hatakeyama-Sato · 2026-05-28 04:00

JMedEthicBench：一个用于评估日本大型语言模型医疗安全的多轮对话基准

arXiv:2601.01627v3 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical safety before clinical use. However, existing safety benchmarks remain predominantly E…
arXiv cs.AI TIER_1 English(EN) · Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan, Yidong Jiang, Yankai Jiang, Jinru Ding, Jiayuan Chen, Zhuangzhi Gao, Pengcheng Chen, Zhao He, Rongzhao Zhang, Meiling Liu, Luyi Jiang, Jie Xu · 2026-05-28 04:00

SafeMed-R1：医疗大语言模型的临床医生审计安全与伦理对齐

arXiv:2605.28338v1 Announce Type: new Abstract: Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to ad…
arXiv cs.AI TIER_1 English(EN) · Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang · 2026-05-28 04:00

反向探测：大型语言模型在临床文本中的监督式词元级不确定性量化

arXiv:2605.28740v1 Announce Type: cross Abstract: As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain gen…
arXiv cs.AI TIER_1 English(EN) · Daisy Zhe Wang · 2026-05-27 17:01

反向探测：大型语言模型在临床文本中的监督式词元级不确定性量化

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the tok…
arXiv cs.CL TIER_1 English(EN) · Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akiko Aizawa, Irene Li · 2026-05-27 04:00

Med-CoReasoner：通过语言信息共推理减少医学推理中的语言差异

arXiv:2601.08267v3 Announce Type: replace Abstract: While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deploymen…
arXiv cs.AI TIER_1 English(EN) · Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang, Juexiao Zhou, Hao Peng, Yue Guo · 2026-05-27 04:00

MedGuideX：将可执行指南中的决策逻辑内化到大型语言模型中以进行临床推理

arXiv:2605.26567v1 Announce Type: new Abstract: Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text tr…
arXiv cs.AI TIER_1 English(EN) · Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley · 2026-05-26 04:00

MoBayes：用于分离对话式临床决策支持中推理与语言的模块化贝叶斯框架

arXiv:2604.20022v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an archite…

报道来源 [10]

相关实体

相关话题