PulseAugur
实时 08:23:40

Study finds global LLM leaderboards misleading, proposes portfolio rankings

A new research paper argues that current leaderboards for large language models (LLMs) are misleading due to significant heterogeneity in user preferences across languages and tasks. The study analyzed approximately 89,000 comparisons from 52 LLMs on Arena, finding that global rankings often obscure distinct subpopulations of user opinions. To address this, the researchers propose a framework of $(\lambda, \nu)$-portfolios, which are small sets of models designed to cover a specific fraction of user preferences with a bounded prediction error. AI

影响 Challenges the validity of current LLM evaluation metrics and suggests a more nuanced approach to model comparison.

排序理由 Academic paper analyzing LLM leaderboards and proposing a new framework.

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

Study finds global LLM leaderboards misleading, proposes portfolio rankings

报道来源 [3]

  1. arXiv cs.LG TIER_1 English(EN) · Jai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta ·

    Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

    arXiv:2605.06656v1 Announce Type: new Abstract: Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best…

  2. arXiv cs.LG TIER_1 English(EN) · Swati Gupta ·

    Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

    Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is mislea…

  3. Hugging Face Daily Papers TIER_1 English(EN) ·

    Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

    Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is mislea…