AI agents face new prompt injection and backdoor attacks

arXiv cs.AI TIER_1 English(EN) · Jun He, Deying Yu · 2026-06-11 04:00

Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure

arXiv:2606.11632v1 Announce Type: cross Abstract: Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Deying Yu · 2026-06-10 03:49

Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure

Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity and access management (IAM), policy engines, conse…

arXiv cs.AI TIER_1 English(EN) · Bijaya Dangol · 2026-06-08 04:00

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

arXiv:2606.07150v1 Announce Type: cross Abstract: Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address-based transport over HTTP(S). Such transports protect message content, increasingly with end-to-end encryption. Wh…

arXiv cs.AI TIER_1 English(EN) · Thamilvendhan Munirathinam · 2026-06-06 04:00

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

arXiv:2606.06460v1 Announce Type: cross Abstract: As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agen…

arXiv cs.AI TIER_1 English(EN) · Hanna Foerster, Tom Blanchard, Kristina Nikoli\'c, Ilia Shumailov, Cheng Zhang, Robert Mullins, Nicolas Papernot, Florian Tram\`er, Yiren Zhao · 2026-06-06 04:00

CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents

arXiv:2601.09923v3 Announce Type: replace Abstract: AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior. Among proposed defenses, architectural isolation provides the strongest guarantees by strictly separating trusted task plannin…

arXiv cs.AI TIER_1 English(EN) · Charlie Summers, Eugene Wu · 2026-06-06 04:00

Data Flow Control: Data Safety Policies for AI Agents

arXiv:2606.05679v1 Announce Type: cross Abstract: Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulat…

arXiv cs.AI TIER_1 English(EN) · Rufat Asadli, Benjamin Hoffman, Ioannis Protogeros, Laurent Vanbever · 2026-06-06 04:00

Evaluating Agentic Configuration Repair for Computer Networks

arXiv:2606.06212v1 Announce Type: new Abstract: Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-o…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Bijaya Dangol · 2026-06-05 11:07

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address-based transport over HTTP(S). Such transports protect message content, increasingly with end-to-end encryption. What they leave in the clear is the communication gr…

arXiv cs.CL TIER_1 English(EN) · Nicholas Saban · 2026-06-05 04:00

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

arXiv:2606.05233v1 Announce Type: cross Abstract: Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask…

arXiv cs.AI TIER_1 English(EN) · Thamilvendhan Munirathinam · 2026-06-04 17:50

Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (i…

arXiv cs.AI TIER_1 English(EN) · Laurent Vanbever · 2026-06-04 14:20

Evaluating Agentic Configuration Repair for Computer Networks

Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-of-the-art models fail to resolve misconfiguratio…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 04:01

Data Flow Control: Data Safety Policies for AI Agents

Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern …

arXiv cs.AI TIER_1 English(EN) · Pritam Dash, Tongyu Ge, Aditi Jain, Tanmay Shah, Zhiwei Shang · 2026-06-04 04:00

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

arXiv:2606.04329v1 Announce Type: cross Abstract: Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory writ…

arXiv cs.CL TIER_1 English(EN) · Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah · 2026-06-04 04:00

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

arXiv:2603.03205v2 Announce Type: replace Abstract: Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, c…

arXiv cs.AI TIER_1 English(EN) · Yiqi Wang, Jiaqi Zhang, Taotao Cai, Zirui Liu, Qingqiang Sun, Zequn Sun, Zhangkai Wu, Mingkai Zhang, Yanming Zhu · 2026-06-04 04:00

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

arXiv:2606.04990v1 Announce Type: cross Abstract: Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also mak…

arXiv cs.AI TIER_1 English(EN) · Tianneng Shi, Robin Rheem, Dongwei Jiang, Mona Wang, Francisco De La Riega, Zhun Wang, Jingzhi Jiang, Alexander Cheung, Sean Tai, Jonah Cha, Jianhong Tu, Gabriel Han, Chenguang Wang, Jingxuan He, Wenbo Guo, Dawn Song · 2026-06-04 04:00

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

arXiv:2606.04460v1 Announce Type: cross Abstract: AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or s…

arXiv cs.AI TIER_1 English(EN) · Yuanbo Xie, Tianyun Liu, Yingjie Zhang, Suchen Liu, Yulin Li, Liya Su, Tingwen Liu · 2026-06-04 04:00

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

arXiv:2606.04425v1 Announce Type: cross Abstract: Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts…

arXiv cs.AI TIER_1 English(EN) · Juan Figuera · 2026-06-04 04:00

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

arXiv:2606.04193v1 Announce Type: cross Abstract: Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and th…

arXiv cs.AI TIER_1 English(EN) · Yanming Zhu · 2026-06-03 15:12

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audi…

arXiv cs.AI TIER_1 English(EN) · Jinliang Xu · 2026-06-03 04:00

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

arXiv:2606.03163v1 Announce Type: cross Abstract: This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection. It specifies the role architecture, identity objects, registration workflow, Root-governed …

arXiv cs.AI TIER_1 English(EN) · Eliot Krzysztof Jones, Mateusz Dziemian, Matt Fredrikson, J Zico Kolter · 2026-06-03 04:00

A New Framework for Cybersecurity Refusals in AI Agents

arXiv:2606.02644v1 Announce Type: cross Abstract: Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified risks in domains like cybersecurity. Existing benchmarks for AI agents in cybersecurity focus …

arXiv cs.AI TIER_1 English(EN) · Jinliang Xu · 2026-06-03 04:00

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

arXiv:2606.03161v1 Announce Type: cross Abstract: OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: befor…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jinliang Xu · 2026-06-02 05:18

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection. It specifies the role architecture, identity objects, registration workflow, Root-governed lifecycle, Root-verified package model, authorizat…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jinliang Xu · 2026-06-02 05:18

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection. It specifies the role architecture, identity objects, registration workflow, Root-governed lifecycle, Root-verified package model, authorizat…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jinliang Xu · 2026-06-02 05:14

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jinliang Xu · 2026-06-02 05:14

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke…

arXiv cs.AI TIER_1 English(EN) · Riddhi Mohan Sharma · 2026-06-02 04:00

Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems

arXiv:2605.17909v2 Announce Type: replace Abstract: As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for high-frequency policy updates presents a fundamental safety gap. We present Ethical Hyper-Ve…

arXiv cs.AI TIER_1 English(EN) · Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder, Nan Jiang · 2026-06-02 04:00

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

arXiv:2606.00925v1 Announce Type: cross Abstract: Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills th…

arXiv cs.AI TIER_1 English(EN) · Florian Holzbauer, David Schmidt, Gabriel Gegenhuber, Sebastian Schrittwieser, Johanna Ullrich · 2026-06-02 04:00

Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

arXiv:2603.16572v2 Announce Type: replace-cross Abstract: Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality. Their growing popularity has led to dedicated marketplaces resembling mobile app stores, as well as automated scanners t…

arXiv cs.AI TIER_1 English(EN) · Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Leo Yu Zhang · 2026-06-02 04:00

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills

arXiv:2602.06547v3 Announce Type: replace-cross Abstract: LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. Community registries have emerged to …

arXiv cs.AI TIER_1 English(EN) · Jeremy Tien, Abishek Anand, Yu-Rou Tuan, Yuchen Shen, J. Zico Kolter, Aran Nayebi · 2026-06-02 04:00

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

arXiv:2606.00341v1 Announce Type: cross Abstract: As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work h…

arXiv cs.CL TIER_1 English(EN) · Soham Roy, Sarthakbrata Halder, Arya Bharaty, Vaibhav Bhaskar, Yash Sinha, Dhruv Kumar, Srikant Panda, Murari Mandal · 2026-06-02 04:00

"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents

arXiv:2606.00497v1 Announce Type: cross Abstract: Deceptive web content, widely instantiated across the internet and commonly known as \textit{social-engineering attacks}, manipulates autonomous web agents into submitting users' personally identifiable information (PII) to attack…

arXiv cs.CL TIER_1 English(EN) · Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen, Xinhao Deng, Yanming Guo, Yuxiang Xie, Baihui Zheng, Yingshui Tan, Yige Li, Yutao Wu, Yixu Wang, Kerui Cao, Wenke Huang, Xingjun Ma, Yu-Gang Jiang · 2026-06-02 04:00

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

arXiv:2606.01166v1 Announce Type: cross Abstract: Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or fi…

arXiv cs.AI TIER_1 English(EN) · Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narasimhan Sreedhar, Prasoon Varshney, Rebecca Qian, Anand Kannappan · 2026-06-02 04:00

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

arXiv:2606.01567v1 Announce Type: cross Abstract: Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions f…

arXiv cs.AI TIER_1 English(EN) · Hao Cheng, Changtao Miao, Tianle Song, Yin Wu, He Liu, Erjia Xiao, Junchi Chen, Xiaoyu Shi, Yichi Wang, Jing Yang, Taowen Wang, Jinhao Duan, Mengshu Sun, Peiyan Dong, Xuan Shen, Yang Cao, Renjing Xu, Kaidi Xu, Jindong Gu, Bo Zhang, Jize Zhang, Chenhao Li… · 2026-06-02 04:00

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

arXiv:2606.02302v1 Announce Type: cross Abstract: Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks tha…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

BraveGuard: From Open-World Threats to Safer Computer-Use Agents

BraveGuard is a self-evolving defense framework that trains guard models using open-world threat signals and realistic agent trajectories to improve safety detection in computer-use agents.

arXiv cs.AI TIER_1 English(EN) · Chao Shen · 2026-06-01 14:23

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluatio…

arXiv cs.AI TIER_1 English(EN) · Brian Crawford, Patrick McClure · 2026-06-01 04:00

Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents

arXiv:2605.30677v1 Announce Type: cross Abstract: Agentic software reverse engineering systems are vulnerable to prompt injection attacks placed into the source code of executable binary files. This research demonstrates defensive tactics for detecting the presences of prompt inj…

arXiv cs.AI TIER_1 English(EN) · Xianzhen Luo, Jingyuan Zhang, Shiqi Zhou, Jinyang Huang, Chuan Xiao, Qingfu Zhu, Zhiyuan Ma, Xing Yue, Yang Yue, Wencong Zeng, Wanxiang Che · 2026-06-01 04:00

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

arXiv:2602.03012v3 Announce Type: replace-cross Abstract: Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data …

arXiv cs.AI TIER_1 English(EN) · Brian Crawford, Justin Phillips, Patrick McClure · 2026-06-01 04:00

Automatically Attacking Software Reverse Engineering AI Agents

arXiv:2605.30667v1 Announce Type: cross Abstract: Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static analysis without having access to original source code. Coupled with the analytic power of lar…

arXiv cs.AI TIER_1 English(EN) · Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu, Yiruo Cheng, Xiaoxi Li, Ji-Rong Wen · 2026-06-01 04:00

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

arXiv:2605.31042v1 Announce Type: cross Abstract: LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such cap…

arXiv cs.CL TIER_1 English(EN) · Ji-Rong Wen · 2026-05-29 09:19

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new …

arXiv cs.AI TIER_1 English(EN) · Suliu Qin, Haomin Zhuang, Yujun Zhou, Yufei Han, Xiangliang Zhang · 2026-05-29 04:00

AIRGuard: Guarding Agent Actions with Runtime Authority Control

arXiv:2605.28914v1 Announce Type: cross Abstract: Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The h…

arXiv cs.AI TIER_1 English(EN) · Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Jun… · 2026-05-29 04:00

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

arXiv:2605.29801v1 Announce Type: new Abstract: Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering cur…

arXiv cs.AI TIER_1 English(EN) · Galip Tolga Erdem · 2026-05-29 04:00

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

arXiv:2605.30096v1 Announce Type: cross Abstract: Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repeated trials remains unstudied. This work presents the first large-scale empirical measureme…

arXiv cs.AI TIER_1 Svenska(SV) · Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, Wenke Huang · 2026-05-29 04:00

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

arXiv:2604.06811v2 Announce Type: replace-cross Abstract: Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack th…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

Multi-step trojan attacks in local LLM agents can bypass existing defenses by embedding malicious prompts across multiple operations, requiring new detection methods like DASGuard for effective protection.

arXiv cs.AI TIER_1 English(EN) · Galip Tolga Erdem · 2026-05-28 15:39

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repeated trials remains unstudied. This work presents the first large-scale empirical measurement of LLM attack consistency: 400 autonomous penet…

arXiv cs.CL TIER_1 English(EN) · Xia Hu · 2026-05-28 11:48

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for r…

arXiv cs.AI TIER_1 Svenska(SV) · Chang Jin, An Wang, Zeming Wei, Kai Wang, Biaojie Zeng, Qiaosheng Zhang, Chao Yang, Jingjing Qu, Xia Hu, Xingcheng Xu · 2026-05-28 04:00

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

arXiv:2605.12015v2 Announce Type: replace-cross Abstract: Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces …

arXiv cs.LG TIER_1 English(EN) · Meghana Bhange, Ulrich A\"ivodji, Elliot Creager · 2026-05-28 04:00

Test-Time Collective Action: Proxy-Based Perturbations for Correcting Algorithmic Harms

arXiv:2605.27689v1 Announce Type: new Abstract: When machine learning systems under-perform for particular subgroups, affected users typically have no way to correct these disparities without relying on platform-level fixes. Existing approaches to algorithmic fairness rely on pro…

arXiv cs.AI TIER_1 English(EN) · Yaoyu Zhao, Yichen Xu, Oliver Bra\v{c}evac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky · 2026-05-28 04:00

LACUNA: Safe Agents as Recursive Program Holes

arXiv:2605.28617v1 Announce Type: new Abstract: LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any o…

arXiv cs.CL TIER_1 English(EN) · Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang · 2026-05-28 04:00

TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling

arXiv:2605.27690v1 Announce Type: new Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insuffici…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-28 00:00

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

A lightweight and scalable agent safety alignment framework is proposed to address emerging threats from advanced AI models, featuring taxonomy-guided training with minimal samples and efficient deployment in real-world scenarios.

arXiv cs.AI TIER_1 English(EN) · Martin Odersky · 2026-05-27 15:27

LACUNA: Safe Agents as Recursive Program Holes

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the run…

arXiv cs.AI TIER_1 English(EN) · Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang · 2026-05-27 04:00

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

arXiv:2505.11063v3 Announce Type: replace Abstract: LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propag…

arXiv cs.AI TIER_1 English(EN) · Yige Li, Yunhao Feng, Jun Sun · 2026-05-27 04:00

Position: AI Safety Requires Effective Controllability

arXiv:2605.27117v1 Announce Type: new Abstract: AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not…

arXiv cs.AI TIER_1 English(EN) · Yinghan Hou, Zongyou Yang, Zaihu Pang, Xiujun Ma · 2026-05-27 04:00

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

arXiv:2604.06550v2 Announce Type: replace-cross Abstract: OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss o…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

LACUNA: Safe Agents as Recursive Program Holes

LACUNA is a programming model that enables LLM agents to write code that shapes the runtime while maintaining safety through type checking and controlled execution.

arXiv cs.AI TIER_1 English(EN) · Jun Sun · 2026-05-26 14:53

Position: AI Safety Requires Effective Controllability

AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can b…

arXiv cs.AI TIER_1 English(EN) · Jingwei Sun, Jianing Zhu, Yuanyi Li, Tongliang Liu, Xia HU, Bo Han · 2026-05-26 04:00

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

arXiv:2605.25707v1 Announce Type: new Abstract: Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-…

arXiv cs.AI TIER_1 English(EN) · Jinhu Qi, Muzhi Li, Jiahong Liu, Yuqin Shu, Dianzhi Yu, Shicheng Ma, Wenqian Cui, Yiyang Zhao, Yiyi Chen, Ruoxi Jiang, Irwin King, Zenglin Xu · 2026-05-26 04:00

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

arXiv:2605.23989v1 Announce Type: new Abstract: Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes tha…

arXiv cs.AI TIER_1 English(EN) · Bo Han · 2026-05-25 11:09

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applicati…

arXiv cs.AI TIER_1 English(EN) · Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin · 2026-05-25 04:00

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

arXiv:2602.12316v2 Announce Type: replace Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and…

arXiv cs.LG TIER_1 English(EN) · Jonathan N\"other, Adish Singla, Goran Radanovic · 2026-05-25 04:00

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

arXiv:2602.04431v2 Announce Type: replace Abstract: LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agenti…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 00:00

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Computer-use agents powered by multimodal large language models face significant challenges in real-world environments due to dynamic disruptions, necessitating robustness evaluation and improved framework designs.

arXiv cs.CL TIER_1 English(EN) · Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi · 2026-05-22 04:00

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

arXiv:2605.22643v1 Announce Type: new Abstract: Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-21 15:50

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what…

arXiv cs.CL TIER_1 English(EN) · Daniele Nardi · 2026-05-21 15:50

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what…

arXiv cs.AI TIER_1 English(EN) · Ahmad-Reza Sadeghi · 2026-05-21 14:47

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncert…

METR (Model Evaluation & Threat Research) TIER_1 中文(ZH) · 2026-01-29 22:12

Frontier AI Safety Regulations: A Reference Guide for AI Company Employees

<a class="button button-primary button-wide max-width-100" href="https://metr.org/frontier-ai-regulations.pdf">查看英文 PDF 版</a> OpenAI、Google、Anthropic、xAI 等前沿 AI 开发者，以及部分中国 AI 开发者，已经需要遵守多项安全与安保义务。主要来源包括加州 SB 53、纽约 RAISE 法案，以及欧盟《人工智能法》中有关前沿 AI …

METR (Model Evaluation & Threat Research) TIER_1 Español(ES) · 2026-01-29 22:12

Frontier AI Safety Regulation: A Reference for Lab Personnel

<a class="button button-primary button-wide max-width-100" href="https://metr.org/frontier-ai-regulations.pdf">Ver como PDF</a> Los desarrolladores de IA de frontera como OpenAI, Google, Anthropic, xAI y otros tienen obligaciones de seguridad…

LessWrong (AI tag) TIER_1 English(EN) · Koby Lewis · 2026-05-28 23:04

A Call for Better Type Hints in AI Safety Tooling

Good type hints lead to code that is more <a href="https://link.springer.com/article/10.1007/s10664-013-9289-1" rel="noopener nofollow" target="_blank">maintainable, easier to understand</a>, and with <a href="https://blog.acolyer.org/201…

AWS Machine Learning Blog TIER_1 English(EN) · Bharathi Srinivasan · 2026-06-01 17:54

Secure AI agents with Policy and Lambda interceptors in Amazon Bedrock AgentCore gateway

In this post, we use a lakehouse data agent to demonstrate how you can use Policy for deterministic access control and Lambda interceptors for dynamic validation. We then show how to combine Lambda interceptors and Policy to implement a geography-based access control which requir…

Forbes — Innovation TIER_1 English(EN) · Anand Oswal, CommunityVoice · 2026-06-11 13:45

Securing The Agentic Enterprise

As organizations move beyond simple chatbots toward autonomous "compound systems" of agents, the traditional tech landscape has shifted.

Forbes — Innovation TIER_1 English(EN) · Suman Sharma, Forbes Councils Member · 2026-06-08 10:15

Why Consumer AI Agents Need Runtime Security, Not Just Governance

Without the right controls, consumer-facing AI agents can expose organizations to regulatory violations, privacy breaches, eroded trust and reputational damage.

Forbes — Innovation TIER_1 English(EN) · Lydia Zhang, Forbes Councils Member · 2026-06-04 11:30

Beyond The AI Hype: Why Continuous Security Validation Matters More Than Ever

Continuous testing matters because infrastructure changes constantly.

Forbes — Innovation TIER_1 English(EN) · Robert Bobel, Forbes Councils Member · 2026-06-03 12:00

The AI Agent Identity Is Redefining Governance And Expanding Your Attack Surface

When an AI-driven process performs an action, accountability can span multiple teams, leaving no single point of responsibility.

Forbes — Innovation TIER_1 English(EN) · Arti Raman, Forbes Councils Member · 2026-05-29 14:30

Eliminating The Dangerous Enterprise AI Blind Spot

Organizations are confronting the growing gap between AI hype and measurable business impact. This is exposing major blind spots in governance, usage visibility and operational oversight.

Forbes — Innovation TIER_1 English(EN) · Tom Kellermann, Forbes Councils Member · 2026-05-28 13:30

Cyber Vigilance In An Era Of AI

Threat detection and response must be accelerated across your entire digital estate.

MarkTechPost TIER_1 English(EN) · Sana Hassan · 2026-05-31 20:07

An Implementation of the Microsoft Agent Governance Toolkit for Safe AI Agent Tool Use with Policies, Approvals, Audit Logs, and Risk Controls

In this tutorial, we build a governed AI-agent workflow using Microsoft’s Agent Governance Toolkit as the reference point. We create a Colab-ready implementation where agents do not directly execute tools; instead, every action first passes through a governance layer that chec…

dev.to — Claude Code tag TIER_1 English(EN) · Marcus Rowe · 2026-05-28 22:19

SymJack: The Supply Chain Attack That Turns Your AI Coding Agent Against You

Your AI coding agent just became an attack vector. That's the short version of what Adversa AI published this week. The research team disclosed a technique called SymJack — a symlink hijacking attack that turns AI coding assistants into supply chain attack delivery syst…

dev.to — MCP tag TIER_1 English(EN) · Manveer Chawla · 2026-06-09 20:50

AI agent governance and runtime compliance framework for CISOs

AI agents are now in production across healthcare, financial services, and critical SaaS systems. They mutate data, trigger workflows, and call external APIs on behalf of real users. These are autonomous actors, not the read-only recommendation engines that security teams alre…

dev.to — MCP tag TIER_1 English(EN) · Baris Sozen · 2026-06-04 06:55

A judge or the math: two trust models for autonomous agent settlement

When an AI agent settles a trade with no human watching, something has to make that trade trustworthy. There are two serious ways to do it, and they are not the same. One puts a judge in the loop. The other replaces the judge with math. Most of the current debate about "trust …

Medium — MCP tag TIER_1 English(EN) · yunwei37 · 2026-06-02 11:11

ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

<div class="medium-feed-item"><a href="https://medium.com/@yunwei356/acrfence-preventing-semantic-rollback-attacks-in-agent-checkpoint-restore-b0d00f5e8b7b?source=rss------mcp-5"><img src="https://cdn-images-1.medium.com/max/1338/0*SgpvKg2tMyMYi5aC.pn…

dev.to — MCP tag TIER_1 English(EN) · 云微 · 2026-06-02 11:11

ACRFence: Preventing Semantic Rollback Attacks in Agent Checkpoint-Restore

AI agent frameworks are bringing checkpoint/restore, time travel, and rewind into everyday developer workflows. If an agent makes a mistake, it can go back to a checkpoint. If a user wants to explore another path, the agent can branch from an earlier state. This is useful for …

dev.to — MCP tag TIER_1 English(EN) · Baris Sozen · 2026-06-02 06:09

Atomic across what? The asterisk hiding in agent settlement

"Atomic" is having a moment. It is showing up in funding announcements, in launch threads, in agent-commerce pitch decks. This week a team raised $25M for an atomic OTC desk built on HTLCs and Bitcoin Taproot, with no custodian holding the assets mid-trade. Th…

dev.to — Anthropic tag TIER_1 한국어(KO) · AI OpenFree · 2026-05-30 18:24

Claude AI Threat Prevention Design: Internalizing Values Instead of Filters

<h1> 클로드를 협박에 쓰지 못하게 막는 것과, 클로드가 스스로 협박하지 않도록 만드는 것은 전혀 다른 문제다 </h1> 앤트로픽이 '클로드'의 자기검열을 설계한 방식 — 그리고 왜 이것이 단순한 필터 이야기가 아닌가 <a class="article-body-image-wrapper" href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cform…

Medium — AI coding tag TIER_1 English(EN) · Anna Jey · 2026-05-29 07:58

AI Coding Agent Configuration Audit: How to Find Risky Repos Before Agents Run

<div class="medium-feed-item"><a href="https://medium.com/toward-next-ai/ai-coding-agent-configuration-audit-how-to-find-risky-repos-before-agents-run-98c0b34ed7b9?source=rss------ai_coding-5"><img src="https://cdn-images-1.medium.com/max/1672/1*vANJy…

dev.to — LLM tag TIER_1 English(EN) · WonderLab · 2026-06-05 10:04

Agent Series (13): Agent Security and Defense — Prompt Injection, Tool Abuse, and Data Leakage

<h2> An Agent's Attack Surface Is Bigger Than You Think </h2> A plain LLM application has one attack surface: user input → LLM output. Add tools to the mix, and it triples: <div class="highlight js-code-highlight"> <pre class="highlight plaintext"><code>User …

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-05 02:16

"No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills" LLM-powered agents can silently delete documents, leak credentials, or trans

"No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills" LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own …

LINKS arxiv.org/…/2605.13044

dev.to — LLM tag TIER_1 English(EN) · Vaishnavi Gudur · 2026-06-02 17:25

Your AI Agent Has a Memory Problem: How Attackers Can Permanently Hijack It

Last week, I ran a simple experiment: I poisoned my own AI agent's memory with 3 lines of code. The result? The agent started leaking user data to an attacker-controlled endpoint — and it had no idea. <h2> The Attack </h2> Here's what memory poisoning looks like in prac…

dev.to — LLM tag TIER_1 English(EN) · Falcons Edge · 2026-06-01 17:47

AI Agent Security: Securing Autonomous Agents in Production

Autonomous AI agents are moving from research labs into production environments at speed. Unlike chatbots that respond to single prompts, agents plan, reason, execute multi-step tasks, call external tools, and delegate sub-tasks to child agents. With each of these capabilities…

dev.to — LLM tag TIER_1 English(EN) · Loïc Fontaine · 2026-06-01 13:20

Catch prompt injection (and leaked secrets) in your AI agent's outgoing messages

AI agents now send email, post messages, and call tools on their own. We spend a lot of energy guarding the input — the user's prompt. We spend almost none on the output: what the agent is actually about to send. That's the …

r/MachineLearning TIER_1 English(EN) · /u/TheAchraf99 · 2026-06-01 08:15

[P] Free AI Agent Security Assessment [P]

<div class="md">Hey everyone, We’re building Antitech, a security layer for AI agents and LLM-powered workflows. We’re opening a small number of free early-access assessments for teams/builders working on AI agents. If you g…

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-30 21:16

🚨 Prompt Injection RCE in AI Agent Frameworks Critical vulnerabilities allowing remote code execution through prompt injection in AI agent systems. Key findings

🚨 Prompt Injection RCE in AI Agent Frameworks Critical vulnerabilities allowing remote code execution through prompt injection in AI agent systems. Key findings: • Which frameworks are affected • Attack chain to RCE • Real-world scenarios • Defensive recommendations Analysis → ht…

LINKS cyber.murati.net

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-30 19:19

🚨 Prompt Injection Flaws Enable RCE in Popular AI Agent Frameworks Critical vulnerability allowing remote code execution through prompt injection in AI agent sy

🚨 Prompt Injection Flaws Enable RCE in Popular AI Agent Frameworks Critical vulnerability allowing remote code execution through prompt injection in AI agent systems. Full technical analysis → https:// cyber.murati.net # cybersecurity # infosec # AI # promptinjection # RCE # CVE

LINKS cyber.murati.net

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-05-29 07:05

The Calculator Discipline — A Taxonomy and Pre-Send Filter for AI-Assisted Vulnerability Disclosure Hallucinations - Paper and Tool by Stuart Thomas, independen

The Calculator Discipline — A Taxonomy and Pre-Send Filter for AI-Assisted Vulnerability Disclosure Hallucinations - Paper and Tool by Stuart Thomas, independent Security Researcher # Infosec # LLM # AI https:// stuart-thomas.com/research/cal culator-discipline/

LINKS stuart-thomas.com/…/calculator-discipline

dev.to — LLM tag TIER_1 English(EN) · Gian Paolo · 2026-05-27 07:08

Critical Flaw: Millions of AI Agents at Risk

<h2> The Silent Threat: When Your AI Turns Against You </h2> Your AI agent is sorting through a thousand new customer support emails, summarizing key issues and drafting responses. It has access to your company's private knowledge base, customer data, and internal APIs. It’s a…

dev.to — LLM tag TIER_1 English(EN) · pueding · 2026-05-25 11:30

Boiling the Frog Paper: Multi-Turn Norm Erosion vs Single-Prompt Agent Safety

What: The Boiling the Frog benchmark is a stateful multi-turn safety eval for tool-using AI agents — it walks a scenario from benign edits to risk-bearing actions and scores whether the agent accepts the escalated final turn. Wh…

dev.to — LLM tag TIER_1 English(EN) · Vaishnavi Gudur · 2026-05-19 23:40

AgentThreatBench: The First OWASP Agentic Top 10 Security Benchmark

The AI safety community has a blind spot. We have excellent benchmarks for measuring whether an LLM will output harmful content (like toxicity or jailbreaks), and we have benchmarks for measuring whether an agent can successfully complete a task (like SWE-bench or WebArena).</…

Mastodon — mastodon.social TIER_1 Français(FR) · [email protected] · 2026-05-30 22:00

"Claw Patrol": an open-source firewall designed specifically for AI agents. The basic idea is solid — autonomous agents have a diff attack surface

"Claw Patrol" : un firewall open-source conçu spécifiquement pour les agents IA. L'idée de base est solide — les agents autonomes ont une surface d'attaque différente des apps classiques : appels d'outils, chaînes de prompts, accès externes. Avoir une couche de contrôle dédiée, c…

LINKS deno.com/…/clawpatrol

r/Anthropic TIER_1 English(EN) · /u/theonejvo · 2026-05-30 16:24

PolyRange: Contamination-resistant offensive-AI benchmark for web targets

<div class="md">Author here. The short version of why I built this: Cyber-AI evaluation is converging on the same diagnosis from multiple labs. Anthropic's Claude Mythos system card this year: their cyber ranges "lack many features often present in r…

r/cursor TIER_2 English(EN) · /u/Few-Ad-1358 · 2026-05-31 10:03

Where should trust checks happen for AI coding agents?

<div class="md">I’ve been using and studying AI coding agents more, and the part I keep getting stuck on is not whether they can write code. They obviously can. The harder question is where trust is supposed to enter the workflow. If an agent touches files outsi…

COVERAGE [104]

RELATED ENTITIES

RELATED TOPICS