PulseAugur
LIVE 15:18:27
research · [1 source] ·
0
research

HuggingFace updates OpenLLM Leaderboard with new benchmarks to better evaluate models

HuggingFace has launched version 2 of its OpenLLM Leaderboard, featuring a revised set of six benchmarks including MMLU-Pro and GPQA. This update was necessitated by existing models reaching performance plateaus on older benchmarks, leading to a significant decrease in absolute scores on the new version. The leaderboard aims to provide a standardized and reproducible evaluation of open-source LLM performance, with over 7,500 models already assessed. The podcast also discussed the limitations of AI arenas and using LLMs as judges, highlighting issues like user bias and potential mode collapse in evaluations. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

RANK_REASON Launch of a new version of an open-source LLM evaluation leaderboard with updated benchmarks.

Read on Latent Space Podcast →

HuggingFace updates OpenLLM Leaderboard with new benchmarks to better evaluate models

COVERAGE [1]

  1. Latent Space Podcast TIER_1 · Latent.Space ·

    Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

    <p><em>The first AI Engineer World’s Fair talks from </em><a href="https://x.com/aiDotEngineer/status/1811093507463074018" target="_blank"><em>OpenAI</em></a><em> and </em><a href="https://x.com/aiDotEngineer/status/1811458198920151536" target="_blank"><em>Cognition</em></a><em> …