English(EN) Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

新的SAEgis框架检测视觉语言模型上的对抗攻击

作者 PulseAugur 编辑部 · [1 个来源] · 2026-05-08 08:53

研究人员开发了一个名为SAEgis的新框架，用于检测视觉语言模型（VLMs）上的对抗攻击。该方法利用稀疏自编码器（SAEs）作为即插即用模块，无需额外的对抗性训练，并引入最小的开销。SAEgis通过利用学习到的稀疏潜在特征有效识别扰动输入，在各种攻击和领域设置中表现出强大的性能，与现有方法相比，在跨领域泛化方面有显著改进。 AI

影响通过提供一种实用的对抗攻击防御方法，增强了视觉语言模型在实际应用中的安全性和可靠性。

排序理由学术论文，提出了一种用于VLMs中对抗攻击检测的新颖方法。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.CL TIER_1 English(EN) · Daisuke Kawahara · 2026-05-08 08:53

稀疏自编码器作为即插即用防火墙，用于视觉语言模型中的对抗性攻击检测

Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain …

报道来源 [1]

稀疏自编码器作为即插即用防火墙，用于视觉语言模型中的对抗性攻击检测

相关实体

相关话题