PulseAugur
实时 11:04:22

新的防御和攻击手段针对大语言模型越狱和提示注入

研究人员正在开发新的方法来防御大型语言模型免受提示注入和越狱攻击。GuardNet 利用浅层神经网络的集成进行高效检测,而 SlotGCG 则专注于优化提示内的攻击位置以利用位置漏洞。NeuroArmor 通过将提示与安全变体进行比较来平衡安全性和有用性,提供运行时防御,而 CRI 则提出了一个框架,通过利用模型激活空间中的合规性方向来增强越狱攻击。 AI

影响 这些研究工作旨在提高大语言模型的安全性和可靠性,使其更安全地广泛部署,并降低与恶意使用相关的风险。

排序理由 多篇学术论文详细介绍了大语言模型安全和安保研究的新方法。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

报道来源 [4]

  1. arXiv cs.AI TIER_1 English(EN) · Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, Jo\~ao Vitor Pavan, Ian Degaspari, Henrique Vieira Laturrague, Patrick Vieira Laturrague, Guilherme Nielsen Dias, Marccello Wilson Perez Berto, Gustavo Voltan… ·

    GuardNet:浅层神经网络的集成策略,用于鲁棒的提示注入和越狱检测

    arXiv:2606.05566v1 Announce Type: new Abstract: Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and parti…

  2. arXiv cs.LG TIER_1 English(EN) · Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee, Woojin Lee ·

    SlotGCG:利用LLM中的位置漏洞进行越狱攻击

    arXiv:2606.05609v1 Announce Type: cross Abstract: As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserti…

  3. arXiv cs.AI TIER_1 English(EN) · Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu ·

    NeuroArmor:用于选择性重锚定的安全变体引导表示一致性,以实现越狱防御

    arXiv:2606.03486v1 Announce Type: cross Abstract: Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses sti…

  4. arXiv cs.LG TIER_1 English(EN) · Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin ·

    Jailbreak Attack Initializations as Extractors of Compliance Directions

    arXiv:2502.09755v4 Announce Type: replace-cross Abstract: Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other …