English(EN) Attention Is Where You Attack

新的攻击将LLM注意力重定向以绕过安全对齐

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-05 04:00

研究人员开发了一种名为“注意力重定向攻击”（ARA）的新型白盒对抗性攻击，该攻击针对已进行安全对齐的大型语言模型的内部注意力机制。该攻击通过构造非语义标记来将注意力从安全关键组件上转移开，从而比以往的方法更有效地绕过对齐。研究发现，虽然移除特定的注意力头对模型影响甚微，但重定向其注意力却显著降低了LLaMA-3和Mistral-7B等模型的安全性能，这表明安全性能源于注意力路由而非局部组件。 AI

影响引入了一种新的攻击向量，可以为未来的LLM安全研究和红队演练提供信息。

排序理由这是一篇详细介绍针对LLM安全机制的新型对抗性攻击的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Aviral Srivastava, Sourav Panda · 2026-05-05 04:00

Attention Is Where You Attack

arXiv:2605.00236v1 Announce Type: cross Abstract: Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Atta…

报道来源 [1]

Attention Is Where You Attack

相关实体

相关话题