English(EN) Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

同人小说子类型用于越狱对齐的LLM

作者 PulseAugur 编辑部 · [2 个来源] · 2026-06-03 06:01

研究人员开发了一种新颖的越狱技术，用于对齐大型语言模型，该技术利用了同人小说子类型。该方法使用来自十二个不同Archive of Our Own (AO3) 子类型的段落来嵌入有害行为，绕过传统防御。该攻击将八个LLM的攻击成功率（ASR）从0.278显著提高到0.731，表明其有效性源于写作风格而非提示结构。提出的防御措施被发现无效，这表明需要转向基于语域的攻击。 AI

影响这项研究突显了LLM安全训练中的新漏洞，可能需要超越简单提示过滤的新型防御机制。

排序理由该集群包含一篇详细介绍越狱LLM新方法的论文。

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang · 2026-06-04 04:00

非分布声音：同人小说子类型作为对齐大语言模型的通用白话越狱方法

arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing t…
arXiv cs.CL TIER_1 English(EN) · Xiaoying Tang · 2026-06-03 06:01

非分布声音：同人小说子类型作为对齐大语言模型的通用白话越狱方法

Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building …

报道来源 [2]

非分布声音：同人小说子类型作为对齐大语言模型的通用白话越狱方法

非分布声音：同人小说子类型作为对齐大语言模型的通用白话越狱方法

相关实体

相关话题