LLM safety benchmarks show high sensitivity to judge configuration choices

作者 PulseAugur 编辑部 · [2 个来源] · 2026-04-27 05:59

A new research paper highlights significant variability in AI safety benchmark results due to judge configuration choices. The study found that altering prompt wording alone, while keeping the judge model constant, could shift measured harmful response rates by as much as 24.2 percentage points. This sensitivity impacts the stability of model safety rankings, with category-level variations ranging up to 39.6 percentage points. The research underscores that the specific wording of prompts used with LLM judges is a critical, under-examined factor influencing safety evaluations. AI

影响 Reveals that current AI safety benchmarks may be unreliable due to prompt sensitivity, necessitating more robust evaluation methods.

排序理由 Academic paper detailing new findings on AI safety benchmark methodology.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Xinran Zhang · 2026-04-28 04:00

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

arXiv:2604.24074v1 Announce Type: new Abstract: Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementati…
arXiv cs.CL TIER_1 English(EN) · Xinran Zhang · 2026-04-27 05:59

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problemati…

报道来源 [2]

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

相关实体

相关话题