PulseAugur
LIVE 04:16:33
research · [22 sources] ·
0
research

New benchmarks and frameworks emerge for evaluating LLMs in healthcare

Researchers have developed new benchmarks and frameworks to evaluate the performance of large language models (LLMs) in the medical domain, addressing limitations in existing datasets. Google Research introduced AfriMed-QA, a comprehensive dataset for African health question answering, and a scalable framework using adaptive precise boolean rubrics for evaluating health LLMs. Additionally, new research explores entity-centric data engineering for multimodal LLMs and the creation of large-scale Dutch medical language corpora. AI

Summary written by gemini-2.5-flash-lite from 22 sources. How we write summaries →

IMPACT New benchmarks and evaluation frameworks are emerging to improve the reliability and generalizability of medical LLMs.

RANK_REASON Multiple research papers and datasets are presented for evaluating LLMs in the medical domain.

Read on Hugging Face Blog →

COVERAGE [22]

  1. Google AI / Research TIER_1 ·

    AfriMed-QA: Benchmarking large language models for global health

    Generative AI

  2. Google AI / Research TIER_1 ·

    A scalable framework for evaluating health language models

    Generative AI

  3. Hugging Face Blog TIER_1 ·

    The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

  4. arXiv cs.CL TIER_1 · Rebecca Soskin Hicks, Mikhail Trofimov, Dominick Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Kavin Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, Karan Singhal ·

    HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

    arXiv:2604.27470v1 Announce Type: new Abstract: Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large la…

  5. arXiv cs.AI TIER_1 · Sukesh Subaharan, Venkatesan VS, Murugadasan P, Sivakumar D, Gautham N, Ganeshkumar M ·

    Modeling Clinical Concern Trajectories in Language Model Agents

    arXiv:2604.27872v1 Announce Type: new Abstract: Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on g…

  6. arXiv cs.AI TIER_1 · Ganeshkumar M ·

    Modeling Clinical Concern Trajectories in Language Model Agents

    Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on gradually rising concern rather than instantaneou…

  7. arXiv cs.CL TIER_1 · Karan Singhal ·

    HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

    Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians brin…

  8. arXiv cs.CL TIER_1 · Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, Linlin Shen ·

    Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

    arXiv:2508.04325v2 Announce Type: replace Abstract: Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clini…

  9. arXiv cs.CL TIER_1 · Pierre Epron (HeKA | U1346, DIG), Adrien Coulet (HeKA | U1346), Mehwish Alam (IP Paris, DIG) ·

    Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

    arXiv:2604.25920v1 Announce Type: new Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare sett…

  10. arXiv cs.CL TIER_1 · Manar Aljohani, Brandon Ho, Kenneth McKinley, Dennis Ren, Xuan Wang ·

    Domain-Adapted Small Language Models for Reliable Clinical Triage

    arXiv:2604.26766v1 Announce Type: new Abstract: Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. Th…

  11. arXiv cs.CL TIER_1 · Md Biplob Hosen, Md Alomgeer Hussein, Md Akmol Masud, Omar Faruque, Tera L Reynolds, Lujie Karen Chen ·

    HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

    arXiv:2604.26880v1 Announce Type: new Abstract: Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA…

  12. arXiv cs.CL TIER_1 · Lujie Karen Chen ·

    HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

    Patient portals now give individuals direct access to their electronic health records (EHRs), yet access alone does not ensure patients understand or act on the complex clinical information contained in these records. The ArchEHR-QA 2026 shared task addresses this challenge by fo…

  13. arXiv cs.CL TIER_1 · Xuan Wang ·

    Domain-Adapted Small Language Models for Reliable Clinical Triage

    Accurate and consistent Emergency Severity Index (ESI) assignment remains a persistent challenge in emergency departments, where highly variable free-text triage documentation contributes to mistriage and workflow inefficiencies. This study evaluates whether open-source small lan…

  14. arXiv cs.CL TIER_1 · B. van Es ·

    Language corpora for the Dutch medical domain

    arXiv:2604.25374v1 Announce Type: new Abstract: \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf…

  15. arXiv cs.CL TIER_1 · Jianghang Lin, Haihua Yang, Deli Yu, Kai Wu, Kai Ye, Jinghao Lin, Zihan Wang, Yuhang Wu, Liujuan Cao ·

    Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

    arXiv:2604.25296v1 Announce Type: new Abstract: Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or…

  16. arXiv cs.AI TIER_1 · Ian M. Campbell ·

    Health System Scale Semantic Search Across Unstructured Clinical Notes

    Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of milli…

  17. arXiv cs.CL TIER_1 · B. van Es ·

    Language corpora for the Dutch medical domain

    \textbf{Background:} Dutch medical corpora are scarce, limiting NLP development. \\ \textbf{Methods:} We translated English datasets, identified medical text in generic corpora, and extracted open Dutch medical resources. \\ \textbf{Results:} The resulting corpus comprises $\pm$ …

  18. arXiv cs.CL TIER_1 · Liujuan Cao ·

    Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

    Multimodal Large Language Models (MLLMs) have shown transformative potential in medical applications, yet their performance is hindered by conventional data curation strategies that rely on coarse-grained partitioning by modality or department. Such fragmented approaches fail to …

  19. arXiv cs.CL TIER_1 Italiano(IT) · Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi, Alessandro Tosi, Antonio Cristiano, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eri ·

    EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

    arXiv:2604.14306v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study pr…

  20. arXiv cs.LG TIER_1 · Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu ·

    FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

    arXiv:2604.22534v1 Announce Type: new Abstract: Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either…

  21. arXiv cs.AI TIER_1 · Anisoara Ionescu ·

    FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

    Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean,…

  22. Databricks Blog TIER_1 ·

    From months to minutes: Building real-time clinical data pipelines with natural language

    This post was co-written by Assunta Carey-Saylor (Senior Product Marketing at Redox)...