AI agents discover advanced LLM attack methods, revealing non-monotonic safety gains

By PulseAugur Editorial · [4 sources] · 2026-05-30 17:07

AI agents are capable of discovering novel adversarial attack algorithms that outperform existing methods against large language models. One study demonstrated that these AI-discovered attacks achieved up to 80% success rate on specific queries against a safeguarded GPT model and 100% against an adversarially robust Meta model. Another paper found that safety alignment in Google's Gemma models is not consistently improving across generations, with Gemma 3 showing a significant increase in attack success rates compared to its predecessor and successor. AI

IMPACT Highlights the escalating arms race in AI safety and security, necessitating adaptive evaluation methods beyond static benchmarks.

RANK_REASON Two research papers detailing novel methods for discovering adversarial attacks against LLMs and analyzing the non-monotonic safety alignment of LLM generations.

Read on arXiv cs.NE (Neural & Evolutionary) →

paper
safety

AI-generated summary · Google Gemini · from 4 sources. How we write summaries →

AI agents discover advanced LLM attack methods, revealing non-monotonic safety gains

COVERAGE [4]

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tiantian Zhu · 2026-06-04 01:28

ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense

LLM-driven automated penetration testing agents are typically evaluated against static targets that neither detect nor respond to attacks, so their behavior under intelligent defense remains untested. The causal consistency of multi-step attack chains likewise hinges on unstable …
arXiv cs.AI TIER_1 English(EN) · Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko · 2026-06-02 04:00

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

arXiv:2603.24511v2 Announce Type: replace-cross Abstract: We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents,…
arXiv cs.CL TIER_1 English(EN) · Subhadip Mitra · 2026-06-02 04:00

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

arXiv:2606.00813v1 Announce Type: cross Abstract: Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find…
arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Subhadip Mitra · 2026-05-30 17:07

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack…

COVERAGE [4]

ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

RELATED ENTITIES

RELATED TOPICS