Researchers have introduced GR-Ben, a new benchmark designed to evaluate the error detection capabilities of process reward models (PRMs) across a wider range of reasoning tasks beyond just mathematics. The benchmark covers scientific and logical reasoning domains, aiming to address the limitations of existing PRMs that primarily focus on mathematical errors. Experiments with 22 models revealed that current PRMs and large language models (LLMs) are significantly weaker at detecting errors in non-mathematical domains, with PRMs struggling with knowledge-based errors and LLMs with computational ones. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT GR-Ben aims to improve the general reasoning and error detection capabilities of LLMs and PRMs in diverse domains.
RANK_REASON This is a research paper introducing a new benchmark for evaluating AI models. [lever_c_demoted from research: ic=1 ai=1.0]