PulseAugur
实时 11:04:59
English(EN) Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

新的基准揭示了 LLM 在动态临床决策中的局限性

研究人员开发了新的基准来评估大型语言模型 (LLM) 在动态临床决策场景中的能力。MedSP1000 源自标准化患者案例,评估 LLM 随时间管理患者护理的能力,显示即使是 GPT-5.5 等顶级模型也只能满足专家标准的约 60%。同样,多模态 LLM BreastGPTBreastStage-Bench 上针对乳腺癌护理进行了评估,显示出潜力但突出了对与工作流程一致的数据的需求。ClinicalMC 为多疗程临床决策提供了另一个基准,在静态和动态环境中评估各种 LLM。 AI

影响 这些新基准突显了 LLM 在复杂、动态医疗场景中目前的局限性,表明它们尚未准备好直接集成到临床中。

排序理由 多篇研究论文介绍了用于评估临床环境中 LLM 的新基准和模型。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。 我们如何撰写摘要 →

报道来源 [7]

  1. arXiv cs.CL TIER_1 English(EN) · Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie ·

    Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

    arXiv:2606.05112v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and a…

  2. arXiv cs.CL TIER_1 English(EN) · Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia ·

    BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

    arXiv:2606.04911v1 Announce Type: cross Abstract: Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatmen…

  3. arXiv cs.CL TIER_1 English(EN) · Weidi Xie ·

    Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

    Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successiv…

  4. Hugging Face Daily Papers TIER_1 English(EN) ·

    BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

    Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct im…

  5. arXiv cs.CL TIER_1 English(EN) · Yingda Xia ·

    BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

    Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct im…

  6. arXiv cs.AI TIER_1 English(EN) · Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan ·

    ClinicalMC:用于大型语言模型的多课程临床决策基准

    arXiv:2606.03157v1 Announce Type: new Abstract: Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-cours…

  7. Hugging Face Daily Papers TIER_1 English(EN) ·

    Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

    MedSP1000 introduces an interactive benchmark derived from standardized patients to evaluate clinical agents' dynamic performance across encounters, revealing limitations of current large language models in medical applications.