PulseAugur
实时 17:19:31
English(EN) Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

AI代理发现先进的大语言模型攻击方法,揭示不单调的安全收益

AI代理能够发现超越现有方法的新型对抗性攻击算法,用于攻击大型语言模型。一项研究表明,这些AI发现的攻击在针对经过安全防护的GPT模型时,在特定查询上取得了高达80%的成功率,而在针对Meta的对抗性鲁棒模型时成功率达到100%。另一篇论文发现,Google的Gemma模型的安全对齐在不同代际之间并非持续改进,Gemma 3相比其前代和后代模型,攻击成功率显著增加。 AI

影响 凸显了AI安全与安全领域不断升级的军备竞赛,需要超越静态基准的适应性评估方法。

排序理由 两篇研究论文详细介绍了发现大语言模型对抗性攻击的新方法,并分析了大语言模型代际间不单调的安全对齐情况。

在 arXiv cs.NE (Neural & Evolutionary) 阅读 →

AI 生成摘要 · Google Gemini · 来自 4 个来源。 我们如何撰写摘要 →

AI代理发现先进的大语言模型攻击方法,揭示不单调的安全收益

报道来源 [4]

  1. arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tiantian Zhu ·

    ZERO-APT:一种用于智能防御下的LLM驱动的自动化渗透测试的闭环对抗框架

    LLM-driven automated penetration testing agents are typically evaluated against static targets that neither detect nor respond to attacks, so their behavior under intelligent defense remains untested. The causal consistency of multi-step attack chains likewise hinges on unstable …

  2. arXiv cs.AI TIER_1 English(EN) · Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko ·

    Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

    arXiv:2603.24511v2 Announce Type: replace-cross Abstract: We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents,…

  3. arXiv cs.CL TIER_1 English(EN) · Subhadip Mitra ·

    Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

    arXiv:2606.00813v1 Announce Type: cross Abstract: Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find…

  4. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Subhadip Mitra ·

    跨代对抗性攻击转移揭示大型语言模型中非单调的安全对齐

    Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack…