Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Researchers have developed a novel "hacker-fixer loop" to improve the robustness of AI agent benchmarks against reward hacking. This adversarial process uses three LLM agents to iteratively identify and patch vulnerabilities in benchmark verifiers, preventing agents from achieving high scores without genuinely solving tasks. The method significantly reduced hack success rates, even enabling weaker agents to defend against stronger ones, and has led to the release of a new dataset and tools for future research. AI
IMPACT Enhances the reliability of AI agent evaluations, crucial for advancing research and development in multi-agent systems.