English(EN) Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

新的基准测试正在应对复杂环境中的 AI 代理安全问题

作者 PulseAugur 编辑部 · [11 个来源] · 2026-05-19 23:40

研究人员正在开发新的基准测试来解决 AI 代理的安全风险，特别是在多代理和交互式环境中。GT-HarmBench 在博弈论场景中评估前沿模型，揭示了在高风险情况下存在的重大缺陷。Boiling the Frog 和 AgentThreatBench 专注于传统基准测试所忽略的渐进式攻击和间接提示注入，同时评估任务效用和安全性。这些努力旨在为超越简单文本生成的 AI 系统创建更鲁棒的评估方法。 AI

影响这些新的基准测试对于确保日益强大的 AI 代理在真实世界的多代理场景中得到安全部署至关重要。

排序理由多篇介绍 AI 代理安全新基准测试的研究论文。

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 11 个来源。我们如何撰写摘要 →

报道来源 [11]

arXiv cs.AI TIER_1 English(EN) · Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han · 2026-05-26 04:00

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

arXiv:2605.25707v1 Announce Type: new Abstract: Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-…
arXiv cs.AI TIER_1 English(EN) · Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu · 2026-05-26 04:00

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

arXiv:2605.23989v1 Announce Type: new Abstract: Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes tha…
arXiv cs.AI TIER_1 English(EN) · Bo Han · 2026-05-25 11:09

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applicati…
arXiv cs.LG TIER_1 English(EN) · Jonathan N\"other, Adish Singla, Goran Radanovic · 2026-05-25 04:00

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

arXiv:2602.04431v2 Announce Type: replace Abstract: LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agenti…
arXiv cs.AI TIER_1 English(EN) · Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin · 2026-05-25 04:00

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

arXiv:2602.12316v2 Announce Type: replace Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and…
arXiv cs.CL TIER_1 English(EN) · Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi · 2026-05-22 04:00

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 15:50

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what…
arXiv cs.CL TIER_1 English(EN) · Daniele Nardi · 2026-05-21 15:50

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what…
arXiv cs.AI TIER_1 English(EN) · Ahmad-Reza Sadeghi · 2026-05-21 14:47

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncert…
dev.to — LLM tag TIER_1 English(EN) · pueding · 2026-05-25 11:30

Boiling the Frog Paper: Multi-Turn Norm Erosion vs Single-Prompt Agent Safety

What: The Boiling the Frog benchmark is a stateful multi-turn safety eval for tool-using AI agents — it walks a scenario from benign edits to risk-bearing actions and scores whether the agent accepts the escalated final turn. Wh…
dev.to — LLM tag TIER_1 English(EN) · Vaishnavi Gudur · 2026-05-19 23:40

AgentThreatBench: The First OWASP Agentic Top 10 Security Benchmark

The AI safety community has a blind spot. We have excellent benchmarks for measuring whether an LLM will output harmful content (like toxicity or jailbreaks), and we have benchmarks for measuring whether an agent can successfully complete a task (like SWE-bench or WebArena).</…

报道来源 [11]

相关实体

相关话题