PulseAugur
EN
LIVE 13:55:33

CS Researchers Distrust LLM Leaderboards But Still Use Them

A new study reveals that computer science researchers hold a paradoxical view of Large Language Model (LLM) leaderboards. Despite widespread distrust in their reliability and robustness, researchers continue to use these leaderboards as informal guides for decision-making. Peer networks, rather than leaderboards, are the primary mechanism for selecting models, with human-voting leaderboards being more favored than static benchmark ones. The influence of leaderboards also varies significantly by subfield, with Natural Language Processing researchers feeling more pressure to compare against state-of-the-art models than those in HCI or Systems/Privacy. A key missing feature identified by most researchers is cost transparency. AI

IMPACT Highlights how AI evaluation tools influence research practices, suggesting a need for more transparent and practical metrics.

RANK_REASON The cluster contains an academic paper discussing research methodology and findings. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

CS Researchers Distrust LLM Leaderboards But Still Use Them

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Pouya Sadeghi, Anamaria Crisan, Jimmy Lin ·

    The Trust Paradox: How CS Researchers Engage LLM Leaderboards

    arXiv:2605.28966v1 Announce Type: new Abstract: Large language model (LLM) leaderboards rank AI models using standardized benchmarks and have become highly visible across computer science, despite known limitations in their reliability and robustness. Yet how they shape researche…