New benchmark OpenBioRQ tests AI agents' ability to verify biomedical research sources

By PulseAugur Editorial · [1 sources] · 2026-06-20 00:00

A new benchmark called OpenBioRQ has been developed to evaluate the ability of AI agents to verify sources and avoid fabricating citations. The benchmark consists of 12,553 unsolved biomedical research questions across 12 domains, designed to test retrieval-grounded reasoning and tool usage without relying on answer keys. Initial testing revealed that while current agents rarely fabricate citations, a significant percentage link to incorrect papers, and some agents exhibit 'agentic collapse,' ceasing to use tools on more difficult questions. Frontier agents tested showed a performance range of 29-60% on the hardest subset of questions. AI

IMPACT This benchmark could drive improvements in AI's ability to accurately retrieve and cite information, crucial for reliable research assistance.

RANK_REASON The cluster describes a new academic benchmark paper. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Hugging Face Daily Papers →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New benchmark OpenBioRQ tests AI agents' ability to verify biomedical research sources

COVERAGE [1]

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-20 00:00

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

A new biomedical benchmark evaluates agentic models' ability to verify sources and avoid false citations by testing unsolved research questions with no answer keys, revealing significant failures in retrieval-grounded reasoning and tool usage.

COVERAGE [1]

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

RELATED ENTITIES

RELATED TOPICS