English(EN)Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency
新框架揭示大语言模型临床推理缺陷,尽管诊断准确性尚可
作者PulseAugur 编辑部·[6 个来源]·
两篇新研究论文介绍了评估大语言模型(LLMs)临床推理能力的框架。第一篇,CLExEval,采用一种人工干预的循环方法,通过渐进式信息屏蔽来揭示诸如冗余偏见和推理到输出不匹配等失败模式,涉及GPT-4o-mini等模型。第二篇,临床推理图谱(Clinical Reasoning Graphs),采用结构化图表示来分析大语言模型的诊断轨迹,揭示模型虽然表现出诊断能力,但在相似病例中缺乏一致的推理。两项研究都强调,除了简单的准确性指标外,还需要进行过程级别的评估,以确保大语言模型在临床上的可靠应用。
AI
arXiv cs.CL
TIER_1English(EN)·William Philipp, Finn Fassbender, Thorsten Langer, Martje Pauly, Rebecca Herzog, Alexander Baumann, Markus Hobert, Theresa Paulus, Ip Chi Wang, Lukas Goede, Johanna Reimer, Sebastian L\"ons, Ronald B\"ock, Sebastian Fudickar·
Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration and caution, however, remains untested. We intro…
arXiv cs.CL
TIER_1English(EN)·Ajmal M., Abin Roy, Afthab Salam Kanniyan, Jawadh Abdul Kabeer, Jerin James, Preslav Nakov, Zhuohan Xie·
arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations c…
Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations can appear clinically convincing even when the fi…
arXiv cs.AI
TIER_1English(EN)·Nisarg A. Patel (University of California, San Francisco)·
arXiv:2606.29876v1 Announce Type: cross Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reas…
Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reasoning graphs, structured graph representations ext…