English(EN) Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

研究发现 VLM 安全训练存在虚假关联缺陷

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-02 04:00

研究人员发现当前视觉语言模型（VLM）的安全训练存在一个重大缺陷，称为“安全幻觉”。这是因为模型学习到了表面文本模式与安全响应之间的虚假关联，而不是真正理解危害。这些 VLM 很容易被简单的词语替换所欺骗，导致绕过安全措施或不必要地拒绝良性查询。研究提出机器学习解绑（MU）作为一种更有效的安全对齐方法，可将攻击成功率降低高达 60%，不必要拒绝率降低超过 84%。 AI

影响凸显了 VLM 安全训练中的关键漏洞，可能将对齐策略转向更鲁棒的方法，如机器学习解绑。

排序理由学术论文，详细介绍了 VLM 安全方面的新发现和拟议的缓解措施。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu · 2026-06-02 04:00

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

arXiv:2503.11832v5 Announce Type: replace Abstract: Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe qu…

报道来源 [1]

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

相关实体

相关话题