English(EN) Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

新框架揭示大语言模型临床推理缺陷，尽管诊断准确性尚可

作者 PulseAugur 编辑部 · [6 个来源] · 2026-06-29 07:16

两篇新研究论文介绍了评估大语言模型（LLMs）临床推理能力的框架。第一篇，CLExEval，采用一种人工干预的循环方法，通过渐进式信息屏蔽来揭示诸如冗余偏见和推理到输出不匹配等失败模式，涉及GPT-4o-mini等模型。第二篇，临床推理图谱（Clinical Reasoning Graphs），采用结构化图表示来分析大语言模型的诊断轨迹，揭示模型虽然表现出诊断能力，但在相似病例中缺乏一致的推理。两项研究都强调，除了简单的准确性指标外，还需要进行过程级别的评估，以确保大语言模型在临床上的可靠应用。 AI

影响强调了大语言模型临床推理的关键局限性，表明当前的评估方法可能高估了其可靠性，并警示不要在未经核实的情况下将其部署到医疗保健领域。

排序理由两篇介绍大语言模型新评估框架的学术论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 6 个来源。我们如何撰写摘要 →

报道来源 [6]

arXiv cs.CL TIER_1 English(EN) · William Philipp, Finn Fassbender, Thorsten Langer, Martje Pauly, Rebecca Herzog, Alexander Baumann, Markus Hobert, Theresa Paulus, Ip Chi Wang, Lukas Goede, Johanna Reimer, Sebastian L\"ons, Ronald B\"ock, Sebastian Fudickar · 2026-07-02 04:00

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

arXiv:2607.01103v1 Announce Type: new Abstract: Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration …
arXiv cs.CL TIER_1 English(EN) · Sebastian Fudickar · 2026-07-01 15:55

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

Open-response evaluation provides stronger clinical validity than multiple-choice benchmarks but creates a scoring bottleneck that motivates automated LLM-asa-Judge approaches. Whether such evaluators replicate clinical calibration and caution, however, remains untested. We intro…
arXiv cs.CL TIER_1 English(EN) · Ajmal M., Abin Roy, Afthab Salam Kanniyan, Jawadh Abdul Kabeer, Jerin James, Preslav Nakov, Zhuohan Xie · 2026-07-01 04:00

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

arXiv:2606.31608v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations c…
arXiv cs.CL TIER_1 English(EN) · Zhuohan Xie · 2026-06-30 12:56

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations can appear clinically convincing even when the fi…
arXiv cs.AI TIER_1 English(EN) · Nisarg A. Patel (University of California, San Francisco) · 2026-06-30 04:00

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

arXiv:2606.29876v1 Announce Type: cross Abstract: Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reas…
arXiv cs.CL TIER_1 English(EN) · Nisarg A. Patel · 2026-06-29 07:16

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reasoning graphs, structured graph representations ext…

报道来源 [6]

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

Clinician-Level Agreement Without Clinical Caution: LLM Evaluator Limits in Medical AI Benchmarking

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

相关实体

相关话题