PulseAugur
实时 03:25:18

New SAEgis framework detects adversarial attacks on vision-language models

Researchers have developed a new framework called SAEgis to detect adversarial attacks on vision-language models (VLMs). This method utilizes sparse autoencoders (SAEs) as a plug-and-play module, requiring no additional adversarial training and introducing minimal overhead. SAEgis effectively identifies perturbed inputs by leveraging learned sparse latent features, demonstrating strong performance across various attack and domain settings, with notable improvements in cross-domain generalization compared to existing methods. AI

影响 Enhances the safety and reliability of vision-language models in real-world applications by providing a practical defense against adversarial attacks.

排序理由 Academic paper proposing a novel method for adversarial attack detection in VLMs. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

New SAEgis framework detects adversarial attacks on vision-language models

报道来源 [1]

  1. arXiv cs.CL TIER_1 English(EN) · Daisuke Kawahara ·

    Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

    Vision-language models (VLMs) have advanced rapidly and are increasingly deployed in real-world applications, especially with the rise of agent-based systems. However, their safety has received relatively limited attention. Even the latest proprietary and open-weight VLMs remain …