PulseAugur
EN
LIVE 09:44:36

MDIA agent achieves high scores on HealthBench Professional benchmark

Researchers have developed MDIA, a Multi-agent Diagnostic Intelligence Agent, which utilizes a 7-node clinical reasoning graph to achieve strong performance on the HealthBench Professional benchmark. When evaluated using OpenAI's GPT-5.4-2026-03-05, MDIA scored 0.6272, surpassing ChatGPT for Clinicians by 3.72 percentage points. The study indicates that architectural design, including specialty routing and context preservation, significantly impacts agentic performance, rather than solely prompt engineering. The choice of grading model also introduces variability, as demonstrated by MDIA scoring 0.6585 when graded by Gemini 2.5 Pro, highlighting the need for multi-grader evaluations. AI

IMPACT Demonstrates architectural improvements in AI agents can significantly boost performance on clinical benchmarks, suggesting a path beyond prompt engineering.

RANK_REASON Academic paper detailing a new AI system and its performance on a benchmark. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

COVERAGE [3]

  1. arXiv cs.AI TIER_1 English(EN) · Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt, Lev Telyatnikov, Olga Fink ·

    From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

    arXiv:2605.28371v1 Announce Type: new Abstract: Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing …

  2. arXiv cs.LG TIER_1 English(EN) · Olga Fink ·

    From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

    Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly d…

  3. arXiv cs.AI TIER_1 English(EN) · Roberto Cruz, David Rey-Blanco ·

    MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

    arXiv:2605.24699v1 Announce Type: new Abstract: Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent …