PulseAugur
EN
LIVE 15:10:06

Developer finds LLM-as-a-Judge systems are unreliable and biased

A developer built an LLM-based grading system, dubbed "LLM-as-a-Judge," to evaluate responses from other language models. The system was tested against human preferences using data from the LMSYS Chatbot Arena. The experiment revealed two key failures: the judge model exhibited low score stability and a narrow output range, rarely deviating from scores of 7 or 8, thus lacking resolution. Furthermore, the judge model agreed with human preferences only 43% of the time when considering ties as misses, indicating it often failed to distinguish between correct and incorrect answers, sometimes even favoring confident but wrong responses. AI

IMPACT Highlights potential unreliability and bias in automated LLM evaluation, suggesting caution for developers relying on such systems.

RANK_REASON Developer's personal experiment and analysis of LLM-as-a-Judge systems.

Read on dev.to — LLM tag →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

Developer finds LLM-as-a-Judge systems are unreliable and biased

COVERAGE [1]

  1. dev.to — LLM tag TIER_1 English(EN) · Suman Nath ·

    LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

    <p>In <a href="https://dev.to/sumanpro/96-accuracy-was-a-lie-building-an-llm-eval-harness-from-scratch-idi">Part 1</a> the model's job was to pick one of 77 labels, so I could check it with <code>==</code>. But most real LLM output isn't like that — it's a paragraph, a summary, a…