PulseAugur
实时 09:23:28
English(EN) Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

新的进化框架揭示大语言模型安全漏洞

研究人员开发了一个新的质量-多样性进化框架,用于识别大语言模型中的漏洞。该方法名为 MAP-Elites,能够生成可解释的攻击策略,而不仅仅是 token 序列,从而在不同的行为维度上实现多样化的攻击库。在 GPT-4o-miniClaude 3.5 SonnetGemini 2.0 Flash 等模型上进行的实验揭示了模型特有的不同弱点,为增强大语言模型安全性提供了可操作的见解。 AI

影响 提供了一种新颖、可复现的方法来评估大语言模型的安全性并识别模型特有的弱点。

排序理由 该集群包含一篇学术论文,详细介绍了用于大语言模型安全的新研究方法。

在 arXiv cs.NE (Neural & Evolutionary) 阅读 →

AI 生成摘要 · Google Gemini · 来自 3 个来源。 我们如何撰写摘要 →

报道来源 [3]

  1. arXiv cs.AI TIER_1 English(EN) · Saad Hossain, Tom Tseng, Punya Syon Pandey, Samanvay Vajpayee, Matthew Kowal, Nayeema Nonta, Samuel Simko, Stephen Casper, Zhijing Jin, Kellin Pelrine, Sirisha Rambhatla ·

    TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

    arXiv:2602.06911v2 Announce Type: replace-cross Abstract: As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modifications, whether accidental or intentional, becomes critical to minimize risks. However,…

  2. arXiv cs.CL TIER_1 English(EN) · Subhadip Mitra ·

    Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

    arXiv:2606.00801v1 Announce Type: cross Abstract: Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introdu…

  3. arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Subhadip Mitra ·

    用于发现LLM安全领域多样化漏洞的质量-多样性进化

    Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that…