English(EN) Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

研究发现全球LLM排行榜具有误导性，提出投资组合排名

作者 PulseAugur 编辑部 · [3 个来源] · 2026-05-07 17:57

一篇新的研究论文认为，目前大型语言模型（LLM）的排行榜因用户在不同语言和任务上的偏好存在显著异质性而具有误导性。该研究分析了Arena上52个LLM的约89,000次比较，发现全球排名常常掩盖了用户意见的特定亚群。为解决此问题，研究人员提出了一种$(\lambda, \nu)$-投资组合框架，这是一小组模型，旨在以有界的预测误差覆盖特定比例的用户偏好。 AI

影响挑战了当前LLM评估指标的有效性，并提出了一种更细致的模型比较方法。

排序理由学术论文，分析LLM排行榜并提出新框架。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。我们如何撰写摘要 →

报道来源 [3]

arXiv cs.LG TIER_1 English(EN) · Jai Moondra, Ayela Chughtai, Bhargavi Lanka, Swati Gupta · 2026-05-08 04:00

为什么全球大语言模型排行榜具有误导性：异构监督机器学习的小型投资组合

arXiv:2605.06656v1 Announce Type: new Abstract: Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best…
arXiv cs.LG TIER_1 English(EN) · Swati Gupta · 2026-05-07 17:57

为什么全球LLM排行榜具有误导性：异构监督机器学习的小型投资组合

Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is mislea…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-07 17:57

为什么全球大语言模型排行榜具有误导性：异构监督机器学习的小型投资组合

Ranking LLMs via pairwise human feedback underpins current leaderboards for open-ended tasks, such as creative writing and problem-solving. We analyze ~89K comparisons in 116 languages from 52 LLMs from Arena, and show that the best-fit global Bradley-Terry (BT) ranking is mislea…

报道来源 [3]

为什么全球大语言模型排行榜具有误导性：异构监督机器学习的小型投资组合

为什么全球LLM排行榜具有误导性：异构监督机器学习的小型投资组合

为什么全球大语言模型排行榜具有误导性：异构监督机器学习的小型投资组合

相关实体

相关话题