New benchmarks and frameworks emerge for evaluating LLMs in healthcare

By PulseAugur Editorial · [22 sources] · 2024-04-19 00:00

Researchers have developed new benchmarks and frameworks to evaluate the performance of large language models (LLMs) in the medical domain, addressing limitations in existing datasets. Google Research introduced AfriMed-QA, a comprehensive dataset for African health question answering, and a scalable framework using adaptive precise boolean rubrics for evaluating health LLMs. Additionally, new research explores entity-centric data engineering for multimodal LLMs and the creation of large-scale Dutch medical language corpora. AI

IMPACT New benchmarks and evaluation frameworks are emerging to improve the reliability and generalizability of medical LLMs.

RANK_REASON Multiple research papers and datasets are presented for evaluating LLMs in the medical domain.

Read on Hugging Face Blog →

paper
other

AI-generated summary · Google Gemini · from 22 sources. How we write summaries →

New benchmarks and frameworks emerge for evaluating LLMs in healthcare

COVERAGE [22]

Google AI / Research TIER_1 English(EN) · 2025-09-24 19:11

AfriMed-QA: Benchmarking large language models for global health

Generative AI
Google AI / Research TIER_1 English(EN) · 2025-08-26 12:34

A scalable framework for evaluating health language models

Generative AI
Hugging Face Blog TIER_1 English(EN) · 2024-04-19 00:00

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare
arXiv cs.CL TIER_1 English(EN) · Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, Karan Singhal · 2026-05-01 04:00

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

arXiv:2604.27470v1 Announce Type: new Abstract: Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large la…
arXiv cs.AI TIER_1 English(EN) · Sukesh Subaharan, Venkatesan VS, Murugadasan P, Sivakumar D, Gautham N, Ganeshkumar M · 2026-05-01 04:00

Modeling Clinical Concern Trajectories in Language Model Agents

arXiv:2604.27872v1 Announce Type: new Abstract: Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on g…
arXiv cs.AI TIER_1 English(EN) · Ganeshkumar M · 2026-04-30 13:53

Modeling Clinical Concern Trajectories in Language Model Agents

Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on gradually rising concern rather than instantaneou…
arXiv cs.CL TIER_1 English(EN) · Karan Singhal · 2026-04-30 06:13

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians brin…
arXiv cs.CL TIER_1 English(EN) · Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, Linlin Shen · 2026-04-30 04:00

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

arXiv:2508.04325v2 Announce Type: replace Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clini…
arXiv cs.CL TIER_1 English(EN) · Pierre Epron (HeKA | U1346, DIG), Adrien Coulet (HeKA | U1346), Mehwish Alam (IP Paris, DIG) · 2026-04-30 04:00

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

arXiv:2604.25920v1 Announce Type: new Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare sett…
arXiv cs.CL TIER_1 English(EN) · Manar Aljohani, Brandon Ho, Kenneth McKinley, Dennis Ren, Xuan Wang · 2026-04-30 04:00

Domain-Adapted Small Language Models for Reliable Clinical Triage

arXiv:2604.26766v1 Announce Type: new Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. Th…
arXiv cs.CL TIER_1 English(EN) · Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud, Omar Faruque, Tera L Reynolds, Lujie Karen Chen · 2026-04-30 04:00

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

arXiv:2604.26880v1 Announce Type: new Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA…
arXiv cs.CL TIER_1 English(EN) · Lujie Karen Chen · 2026-04-29 16:47

HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by fo…
arXiv cs.CL TIER_1 English(EN) · Xuan Wang · 2026-04-29 15:00

Domain-Adapted Small Language Models for Reliable Clinical Triage

Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small lan…
arXiv cs.CL TIER_1 English(EN) · B. van Es · 2026-04-29 04:00

Language corpora for the Dutch medical domain

arXiv:2604.25374v1 Announce Type: new Abstract: \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf…
arXiv cs.CL TIER_1 English(EN) · Jianghang Lin, Haihua Yang, Deli Yu, Kai Wu, Kai Ye, Jinghao Lin, Zihan Wang, Yuhang Wu, Liujuan Cao · 2026-04-29 04:00

Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

arXiv:2604.25296v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or…
arXiv cs.AI TIER_1 English(EN) · Ian M. Campbell · 2026-04-28 13:09

Health System Scale Semantic Search Across Unstructured Clinical Notes

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of milli…
arXiv cs.CL TIER_1 English(EN) · B. van Es · 2026-04-28 08:38

Language corpora for the Dutch medical domain

\textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ …
arXiv cs.CL TIER_1 English(EN) · Liujuan Cao · 2026-04-28 07:05

Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to …
arXiv cs.CL TIER_1 Italiano(IT) · Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi, Alessandro Tosi, Antonio Cristiano, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eri · 2026-04-27 04:00

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

arXiv:2604.14306v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study pr…
arXiv cs.LG TIER_1 English(EN) · Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu · 2026-04-27 04:00

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

arXiv:2604.22534v1 Announce Type: new Abstract: Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either…
arXiv cs.AI TIER_1 English(EN) · Anisoara Ionescu · 2026-04-24 13:21

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean,…
Databricks Blog TIER_1 English(EN) · 2026-04-28 19:06

From months to minutes: Building real-time clinical data pipelines with natural language

This post was co-written by Assunta Carey-Saylor (Senior Product Marketing at Redox)...

COVERAGE [22]

RELATED ENTITIES

RELATED TOPICS