Apple research: LLM judges suffer from correlated errors, reducing evaluation effectiveness

By PulseAugur Editorial · [1 sources] · 2026-06-23 00:00

A new paper from Apple Machine Learning Research reveals that using multiple Large Language Models (LLMs) as judges for evaluation panels is less effective than expected due to correlated errors. The study found that a panel of nine LLMs effectively provided the information equivalent of only two independent votes, with models making similar mistakes on the same items. This significantly reduces the panel's accuracy compared to what independent voting would achieve, and in some cases, a single strong judge outperformed the entire panel. AI

IMPACT Highlights a critical flaw in current LLM evaluation practices, suggesting a need for more diverse and independent evaluation methods.

RANK_REASON The cluster contains a research paper detailing a new methodology and findings in LLM evaluation. [lever_c_demoted from research: ic=1 ai=1.0]

Read on Apple Machine Learning Research →

paper
other

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Apple research: LLM judges suffer from correlated errors, reducing evaluation effectiveness

COVERAGE [1]

Apple Machine Learning Research TIER_1 English(EN) · 2026-06-23 00:00

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independen…

COVERAGE [1]

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

RELATED ENTITIES

RELATED TOPICS