A new benchmark called OpenBioRQ has been developed to evaluate the ability of AI agents to verify sources and avoid fabricating citations. The benchmark consists of 12,553 unsolved biomedical research questions across 12 domains, designed to test retrieval-grounded reasoning and tool usage without relying on answer keys. Initial testing revealed that while current agents rarely fabricate citations, a significant percentage link to incorrect papers, and some agents exhibit 'agentic collapse,' ceasing to use tools on more difficult questions. Frontier agents tested showed a performance range of 29-60% on the hardest subset of questions. AI
IMPACT This benchmark could drive improvements in AI's ability to accurately retrieve and cite information, crucial for reliable research assistance.
RANK_REASON The cluster describes a new academic benchmark paper. [lever_c_demoted from research: ic=1 ai=1.0]
Read on Hugging Face Daily Papers →
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →