English(EN) How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

新研究发现，自动化的LLM越狱裁判缺乏可靠性

作者 PulseAugur 编辑部 · [7 个来源] · 2026-06-19 16:43

研究人员正在质疑用于评估大型语言模型（LLM）越狱的自动化评分系统的可靠性。一项新研究发现，专用分类器倾向于过度标记攻击，而基于LLM的裁判则表现出不一致的召回率，导致根据所使用的裁判不同，攻击成功率差异很大。这些自动化裁判也容易受到对抗性攻击，简单的文本操纵会显著改变其分数，而专用分类器则更具鲁棒性，但可能被白盒攻击所攻破。研究结果表明，由于这些自动化评估方法的局限性，许多报告的攻击成功率可能并不可靠。 AI

影响强调了在LLM安全研究中需要更强大、更可靠的评估指标，这可能会影响模型安全性的评估方式。

排序理由该集群包含讨论用于评估LLM越狱和ASR错误的自动化系统的局限性和评估的研究论文。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 7 个来源。我们如何撰写摘要 →

报道来源 [7]

arXiv cs.CL TIER_1 English(EN) · Mohammad Aref Jafari-Raddani · 2026-06-25 04:00

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

arXiv:2606.24915v1 Announce Type: new Abstract: End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using la…
arXiv cs.CL TIER_1 English(EN) · Pratik Rakesh Singh, Mohammadi Zaki, Aneesh Mukkamala, Pankaj Wasnik · 2026-06-25 04:00

Graph-Based Phonetic Error Correction of Noisy ASR

arXiv:2606.24889v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing w…
arXiv cs.CL TIER_1 English(EN) · Yang Gao (Veyon Solutions) · 2026-06-25 04:00

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv:2606.25487v1 Announce Type: new Abstract: Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat …
Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-24 07:14

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely che…
arXiv cs.CL TIER_1 English(EN) · Yang Gao · 2026-06-24 07:14

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model prompted to grade. The judge is rarely che…
arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Mohammad Aref Jafari-Raddani · 2026-06-19 16:43

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

End-to-end automatic speech recognition systems frequently hallucinate rare entities and domain-specific terms, especially in low-resource languages. While retrieval-augmented generation frameworks can mitigate these errors using large language models, current architectures face …
Towards AI TIER_1 English(EN) · Dmitriy Nikultsev · 2026-06-23 16:01

为什么词错误率不够用：ASR错误的语义分解

<h4>A feasible framework for evaluating ASR models across semantic categories instead of a single aggregate metric</h4><figure><img alt="Introduction image showing decomposition of general WER into semantic categories, such as people, geography names, etc" src="https://cdn-images…

报道来源 [7]

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

Graph-Based Phonetic Error Correction of Noisy ASR

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

为什么词错误率不够用：ASR错误的语义分解

相关实体

相关话题