Researchers have developed a new framework called SAEgis to detect adversarial attacks on vision-language models (VLMs). This method utilizes sparse autoencoders (SAEs) as a plug-and-play module, requiring no additional adversarial training and introducing minimal overhead. SAEgis effectively identifies perturbed inputs by leveraging learned sparse latent features, demonstrating strong performance across various attack and domain settings, with notable improvements in cross-domain generalization compared to existing methods. AI
影响 Enhances the safety and reliability of vision-language models in real-world applications by providing a practical defense against adversarial attacks.
排序理由 Academic paper proposing a novel method for adversarial attack detection in VLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →