Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 1w

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Researchers have developed a new benchmark called REFLECT to evaluate the reliability of Large Language Models (LLMs) when used as judges for deep research agents. These agents automate complex information-seeking tasks, and their outputs require scalable evaluation, often relying on LLM judges for accuracy and reasoning quality. However, current LLM judges demonstrate significant unreliability, with top models achieving less than 55% accuracy in assessing reasoning, tool use, and report quality, particularly struggling with evidence verification. The REFLECT benchmark provides a detailed taxonomy of failure modes and uses controlled interventions on agent execution traces to create verifiable instances for validating these judges, offering guidance for more robust evaluation pipelines. AI

IMPACT Highlights the unreliability of current LLM judges for evaluating AI agents, necessitating new benchmarks for trustworthy AI development.