PulseAugur
实时 11:01:47

Metis framework learns to jailbreak LLMs with 89.2% success rate

Researchers have developed Metis, a new framework that reformulates LLM jailbreaking as inference-time policy optimization. This approach uses a self-evolving metacognitive loop to diagnose defense logic and refine its attack strategy, offering more interpretable reasoning traces. Metis demonstrated an 89.2% average attack success rate across 10 models, significantly outperforming traditional methods on resilient frontier models and reducing token costs by an average of 8.2x. AI

影响 Highlights vulnerabilities in current LLM defenses, necessitating the development of more robust, dynamic safety mechanisms.

排序理由 The cluster describes a new academic paper detailing a novel framework for LLM security research. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

Metis framework learns to jailbreak LLMs with 89.2% success rate

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Xuelong Li ·

    Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

    Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To addres…