A new research paper highlights significant variability in AI safety benchmark results due to judge configuration choices. The study found that altering prompt wording alone, while keeping the judge model constant, could shift measured harmful response rates by as much as 24.2 percentage points. This sensitivity impacts the stability of model safety rankings, with category-level variations ranging up to 39.6 percentage points. The research underscores that the specific wording of prompts used with LLM judges is a critical, under-examined factor influencing safety evaluations. AI
影响 Reveals that current AI safety benchmarks may be unreliable due to prompt sensitivity, necessitating more robust evaluation methods.
排序理由 Academic paper detailing new findings on AI safety benchmark methodology.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →