A study by Nexus Labs revealed that Large Language Models (LLMs) used as judges exhibit significant position bias, favoring the first answer presented in 18% of comparisons. This bias was observed across models like GPT-4o and Claude 3.5 Sonnet, with smaller models showing a more pronounced effect. To mitigate this, Nexus Labs implemented a dual-pass scoring method where each pair of responses is evaluated in both orders, and only unanimous verdicts are counted, reducing the flip rate to under 4%. AI
IMPACT Highlights a critical flaw in LLM evaluation that could skew benchmark results and impact model development.
RANK_REASON The item details a research finding about LLM evaluation methodology. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →