English(EN) LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

开发者发现 LLM-as-a-Judge 系统不可靠且存在偏见

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-29 08:05

一位开发者构建了一个基于 LLM 的评分系统，称为“LLM-as-a-Judge”，用于评估其他语言模型的响应。该系统使用来自 LMSYS Chatbot Arena 的数据，并与人类偏好进行了测试。实验揭示了两个关键的失败之处：裁判模型表现出较低的分数稳定性以及狭窄的输出范围，很少偏离 7 或 8 分，因此缺乏区分度。此外，在将平局视为失误的情况下，裁判模型与人类偏好的吻合度仅为 43%，表明它常常无法区分正确和错误的答案，有时甚至偏爱自信但错误的响应。 AI

影响强调了自动化 LLM 评估中潜在的不可靠性和偏见，建议依赖此类系统的开发者谨慎行事。

排序理由开发者对 LLM-as-a-Judge 系统的个人实验和分析。

在 dev.to — LLM tag 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

dev.to — LLM tag TIER_1 English(EN) · Suman Nath · 2026-06-29 08:05

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

<p>In <a href="https://dev.to/sumanpro/96-accuracy-was-a-lie-building-an-llm-eval-harness-from-scratch-idi">Part 1</a> the model's job was to pick one of 77 labels, so I could check it with <code>==</code>. But most real LLM output isn't like that — it's a paragraph, a summary, a…

报道来源 [1]

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

相关实体

相关话题