PulseAugur
EN
LIVE 11:41:45

New benchmarks tackle AI agent safety in complex environments

Researchers are developing new benchmarks to address the safety risks of AI agents, particularly in multi-agent and interactive environments. GT-HarmBench evaluates frontier models in game-theoretic scenarios, revealing significant failures in high-stakes situations. Boiling the Frog and AgentThreatBench focus on incremental attacks and indirect prompt injections that traditional benchmarks miss, assessing both task utility and security. These efforts aim to create more robust evaluations for AI systems operating beyond simple text generation. AI

IMPACT These new benchmarks are crucial for ensuring the safe deployment of increasingly capable AI agents in real-world, multi-agent scenarios.

RANK_REASON Multiple research papers introducing new benchmarks for AI agent safety.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 11 sources. How we write summaries →

New benchmarks tackle AI agent safety in complex environments

COVERAGE [11]

  1. arXiv cs.AI TIER_1 English(EN) · Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han ·

    AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

    arXiv:2605.25707v1 Announce Type: new Abstract: Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-…

  2. arXiv cs.AI TIER_1 English(EN) · Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu ·

    Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

    arXiv:2605.23989v1 Announce Type: new Abstract: Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes tha…

  3. arXiv cs.AI TIER_1 English(EN) · Bo Han ·

    AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

    Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applicati…

  4. arXiv cs.LG TIER_1 English(EN) · Jonathan N\"other, Adish Singla, Goran Radanovic ·

    MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

    arXiv:2602.04431v2 Announce Type: replace Abstract: LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agenti…

  5. arXiv cs.AI TIER_1 English(EN) · Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin ·

    GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

    arXiv:2602.12316v2 Announce Type: replace Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and…

  6. arXiv cs.CL TIER_1 English(EN) · Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi ·

    Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant…

  7. Hugging Face Daily Papers TIER_1 English(EN) ·

    Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what…

  8. arXiv cs.CL TIER_1 English(EN) · Daniele Nardi ·

    Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

    Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what…

  9. arXiv cs.AI TIER_1 English(EN) · Ahmad-Reza Sadeghi ·

    Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

    The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncert…

  10. dev.to — LLM tag TIER_1 English(EN) · pueding ·

    Boiling the Frog Paper: Multi-Turn Norm Erosion vs Single-Prompt Agent Safety

    <p><strong>What:</strong> The <strong>Boiling the Frog</strong> benchmark is a stateful multi-turn safety eval for tool-using AI agents — it walks a scenario from benign edits to risk-bearing actions and scores whether the agent accepts the escalated final turn.</p> <p><strong>Wh…

  11. dev.to — LLM tag TIER_1 English(EN) · Vaishnavi Gudur ·

    AgentThreatBench: The First OWASP Agentic Top 10 Security Benchmark

    <p>The AI safety community has a blind spot. We have excellent benchmarks for measuring whether an LLM will output harmful content (like toxicity or jailbreaks), and we have benchmarks for measuring whether an agent can successfully complete a task (like SWE-bench or WebArena).</…