LLM judges show 18% position bias; dual-pass scoring cuts error rate

By PulseAugur Editorial · [1 sources] · 2026-06-25 06:31

A study by Nexus Labs revealed that Large Language Models (LLMs) used as judges exhibit significant position bias, favoring the first answer presented in 18% of comparisons. This bias was observed across models like GPT-4o and Claude 3.5 Sonnet, with smaller models showing a more pronounced effect. To mitigate this, Nexus Labs implemented a dual-pass scoring method where each pair of responses is evaluated in both orders, and only unanimous verdicts are counted, reducing the flip rate to under 4%. AI

IMPACT Highlights a critical flaw in LLM evaluation that could skew benchmark results and impact model development.

RANK_REASON The item details a research finding about LLM evaluation methodology. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLM judges show 18% position bias; dual-pass scoring cuts error rate

COVERAGE [1]

dev.to — LLM tag TIER_1 English(EN) · Marcus Chen · 2026-06-25 06:31

Position bias in LLM-as-judge flipped 18% of our verdicts

TL;DR: Position bias in LLM-as-judge means the model favors whichever answer it reads first. We measured an 18% verdict flip rate from swapping order alone, and dual-pass scoring brought it under 4%. Our pairwise evaluation harness at Nexus Labs scored …

COVERAGE [1]

Position bias in LLM-as-judge flipped 18% of our verdicts

RELATED ENTITIES

RELATED TOPICS