English(EN) How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

LLM安全基准显示出对裁判配置选择的高度敏感性

作者 PulseAugur 编辑部 · [2 个来源] · 2026-04-27 05:59

一项新的研究论文强调，由于裁判配置选择的不同，AI安全基准的结果存在显著差异。研究发现，仅改变提示词的措辞，在保持裁判模型不变的情况下，就可能使测得的有害响应率发生高达24.2个百分点的变化。这种敏感性影响了模型安全排名的稳定性，类别级别的差异高达39.6个百分点。研究强调，用于LLM裁判的提示词的具体措辞是影响安全评估的一个关键但被忽视的因素。 AI

影响揭示了当前AI安全基准可能因提示词敏感性而不可靠，需要更稳健的评估方法。

排序理由学术论文，详细介绍了AI安全基准方法论的新发现。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Xinran Zhang · 2026-04-28 04:00

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

arXiv:2604.24074v1 Announce Type: new Abstract: Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementati…
arXiv cs.CL TIER_1 English(EN) · Xinran Zhang · 2026-04-27 05:59

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problemati…

报道来源 [2]

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

How Sensitive Are Safety Benchmarks to Judge Configuration Choices?

相关实体

相关话题