A new study reveals that computer science researchers hold a paradoxical view of Large Language Model (LLM) leaderboards. Despite widespread distrust in their reliability and robustness, researchers continue to use these leaderboards as informal guides for decision-making. Peer networks, rather than leaderboards, are the primary mechanism for selecting models, with human-voting leaderboards being more favored than static benchmark ones. The influence of leaderboards also varies significantly by subfield, with Natural Language Processing researchers feeling more pressure to compare against state-of-the-art models than those in HCI or Systems/Privacy. A key missing feature identified by most researchers is cost transparency. AI
IMPACT Highlights how AI evaluation tools influence research practices, suggesting a need for more transparent and practical metrics.
RANK_REASON The cluster contains an academic paper discussing research methodology and findings. [lever_c_demoted from research: ic=1 ai=1.0]
- computer science
- HCI
- Large Language Model
- LLM leaderboards
- NLP researchers
- Systems/Privacy researchers
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →