新的基准和框架出现，用于评估医疗领域的大语言模型

作者 PulseAugur 编辑部 · [22 个来源] · 2024-04-19 00:00

研究人员开发了新的基准和框架来评估大语言模型（LLMs）在医疗领域的性能，解决了现有数据集的局限性。Google Research 推出了 AfriMed-QA，这是一个用于非洲健康问答的综合数据集，以及一个使用自适应精确布尔评分标准评估健康 LLMs 的可扩展框架。此外，新的研究探索了多模态 LLMs 的以实体为中心的数据工程以及大规模荷兰医疗语言语料库的创建。 AI

影响新的基准和评估框架正在涌现，以提高医疗 LLMs 的可靠性和泛化能力。

排序理由提出了多篇研究论文和数据集，用于评估医疗领域的 LLMs。

在 Hugging Face Blog 阅读 →

AI 生成摘要 · Google Gemini · 来自 22 个来源。我们如何撰写摘要 →

报道来源 [22]

Google AI / Research TIER_1 English(EN) · 2025-09-24 19:11

AfriMed-QA：用于全球健康的语言大模型基准测试

Generative AI
Google AI / Research TIER_1 English(EN) · 2025-08-26 12:34

一种可扩展的健康语言模型评估框架

Generative AI
Hugging Face Blog TIER_1 English(EN) · 2024-04-19 00:00

Open Medical-LLM排行榜：医疗领域大语言模型基准测试
arXiv cs.CL TIER_1 English(EN) · Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, Karan Singhal · 2026-05-01 04:00

HealthBench Professional：在真实临床医生聊天中评估大型语言模型

arXiv:2604.27470v1 Announce Type: new Abstract: Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large la…
arXiv cs.AI TIER_1 English(EN) · Sukesh Subaharan, Venkatesan VS, Murugadasan P, Sivakumar D, Gautham N, Ganeshkumar M · 2026-05-01 04:00

语言模型代理中的临床关注轨迹建模

arXiv:2604.27872v1 Announce Type: new Abstract: Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on g…
arXiv cs.AI TIER_1 English(EN) · Ganeshkumar M · 2026-04-30 13:53

语言模型代理中临床关注轨迹的建模

Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on gradually rising concern rather than instantaneou…
arXiv cs.CL TIER_1 English(EN) · Karan Singhal · 2026-04-30 06:13

HealthBench Professional：在真实临床医生聊天中评估大型语言模型

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians brin…
arXiv cs.CL TIER_1 English(EN) · Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, Linlin Shen · 2026-04-30 04:00

超越排行榜：重新思考大型语言模型的医学基准

arXiv:2508.04325v2 Announce Type: replace Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clini…
arXiv cs.CL TIER_1 English(EN) · Pierre Epron (HeKA | U1346, DIG), Adrien Coulet (HeKA | U1346), Mehwish Alam (IP Paris, DIG) · 2026-04-30 04:00

分析用于生物医学命名实体识别的轻量级大型语言模型在不同输出格式上的表现

arXiv:2604.25920v1 Announce Type: new Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare sett…
arXiv cs.CL TIER_1 English(EN) · Manar Aljohani, Brandon Ho, Kenneth McKinley, Dennis Ren, Xuan Wang · 2026-04-30 04:00

面向可靠临床分诊的领域自适应小型语言模型

arXiv:2604.26766v1 Announce Type: new Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. Th…
arXiv cs.CL TIER_1 English(EN) · Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud, Omar Faruque, Tera L Reynolds, Lujie Karen Chen · 2026-04-30 04:00

HealthNLP_Retrievers 在 ArchEHR-QA 2026：用于基于事实的临床问答的级联 LLM 管道

arXiv:2604.26880v1 Announce Type: new Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA…
arXiv cs.CL TIER_1 English(EN) · Lujie Karen Chen · 2026-04-29 16:47

HealthNLP_Retrievers 在 ArchEHR-QA 2026：用于基于事实的临床问答的级联 LLM 流水线

Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by fo…
arXiv cs.CL TIER_1 English(EN) · Xuan Wang · 2026-04-29 15:00

面向可靠临床分诊的领域自适应小型语言模型

Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small lan…
arXiv cs.CL TIER_1 English(EN) · B. van Es · 2026-04-29 04:00

荷兰医疗领域的语言语料库

arXiv:2604.25374v1 Announce Type: new Abstract: \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf…
arXiv cs.CL TIER_1 English(EN) · Jianghang Lin, Haihua Yang, Deli Yu, Kai Wu, Kai Ye, Jinghao Lin, Zihan Wang, Yuhang Wu, Liujuan Cao · 2026-04-29 04:00

从医学实体树中学习：面向MLLM的以实体为中心的医学数据工程框架

arXiv:2604.25296v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or…
arXiv cs.AI TIER_1 English(EN) · Ian M. Campbell · 2026-04-28 13:09

医疗系统在非结构化临床笔记中进行大规模语义搜索

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of milli…
arXiv cs.CL TIER_1 English(EN) · B. van Es · 2026-04-28 08:38

荷兰医学领域的语言语料库

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ …
arXiv cs.CL TIER_1 English(EN) · Liujuan Cao · 2026-04-28 07:05

从医学实体树中学习：面向MLLM的以实体为中心的医学数据工程框架

Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to …
arXiv cs.CL TIER_1 Italiano(IT) · Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi, Alessandro Tosi, Antonio Cristiano, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eri · 2026-04-27 04:00

EuropeMedQA 研究方案：用于语言模型评估的多语言、多模态医学考试数据集

arXiv:2604.14306v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study pr…
arXiv cs.LG TIER_1 English(EN) · Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu · 2026-04-27 04:00

FeatEHR-LLM：利用大型语言模型进行电子健康记录中的特征工程

arXiv:2604.22534v1 Announce Type: new Abstract: Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either…
arXiv cs.AI TIER_1 English(EN) · Anisoara Ionescu · 2026-04-24 13:21

FeatEHR-LLM：利用大型语言模型进行电子健康记录中的特征工程

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean,…
Databricks Blog TIER_1 English(EN) · 2026-04-28 19:06

从数月到数分钟：利用自然语言构建实时临床数据管道

This post was co-written by Assunta Carey-Saylor (Senior Product Marketing at Redox)...

报道来源 [22]

相关实体

相关话题