New research reveals ML benchmarks are vulnerable to manipulation

By PulseAugur Editorial · [3 sources] · 2026-05-22 13:40

Researchers have analyzed the susceptibility of machine learning benchmarks to manipulation, treating datasets as voters and models as candidates. They found that strategically including benchmark data in a model's training set to achieve a top leaderboard rank is an NP-hard problem, akin to election bribery. The study introduces 'instance-level robustness' to quantify the minimum datasets needed for manipulation and evaluates this across MMLU and BIG-Bench Hard leaderboards. AI

IMPACT Highlights potential for manipulation in ML leaderboards, urging caution in interpreting benchmark results.

RANK_REASON The cluster contains an academic paper analyzing machine learning benchmarks.

Read on arXiv cs.LG →

paper
safety

AI-generated summary · Google Gemini · from 3 sources. How we write summaries →

New research reveals ML benchmarks are vulnerable to manipulation

COVERAGE [3]

arXiv cs.AI TIER_1 English(EN) · Adil Amin · 2026-05-26 04:00

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

arXiv:2605.18840v2 Announce Type: replace-cross Abstract: Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose pair…
arXiv cs.LG TIER_1 English(EN) · Polina Gordienko, Georg Schollmeyer, Frauke Kreuter, Christoph Jansen · 2026-05-25 04:00

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

arXiv:2605.23628v1 Announce Type: new Abstract: Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating …
arXiv cs.LG TIER_1 English(EN) · Christoph Jansen · 2026-05-22 13:40

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we …

COVERAGE [3]

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

RELATED ENTITIES

RELATED TOPICS