Researchers have developed BT-sigma, a novel method for assessing the reliability of Large Language Models (LLMs) when used as judges in comparative evaluations. This approach extends the Bradley-Terry model by incorporating a discriminator parameter for each LLM judge, allowing for the joint inference of item rankings and judge reliability directly from pairwise comparisons, even without human supervision. Experiments on benchmark datasets demonstrate that BT-sigma significantly outperforms traditional averaging methods and that its learned discriminators correlate well with independent measures of LLM judgment consistency, effectively acting as an unsupervised calibration mechanism. AI
RANK_REASON This is a research paper detailing a new methodology for evaluating LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →