How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness
Researchers have analyzed the susceptibility of machine learning benchmarks to manipulation, treating datasets as voters and models as candidates. They found that strategically including benchmark data in a model's training set to achieve a top leaderboard rank is an NP-hard problem, akin to election bribery. The study introduces 'instance-level robustness' to quantify the minimum datasets needed for manipulation and evaluates this across MMLU and BIG-Bench Hard leaderboards. AI
IMPACT Highlights potential for manipulation in ML leaderboards, urging caution in interpreting benchmark results.