AI 代理通过新的 RAG、模拟和合规性工具取得进展

Google AI / Research TIER_1 English(EN) · 2026-06-05 11:26

利用 Gemini Enterprise Agent Platform 的 Agentic RAG 实现可靠响应

Data Management

Hugging Face Blog TIER_1 English(EN) · 2026-06-17 00:00

Agentic Resource Discovery: Let agents search

Hugging Face Blog TIER_1 English(EN) · 2026-06-05 22:18

千词木：在3B模型上运行多智能体经济体

Qwen tech blog TIER_1 Deutsch(DE) · QwenTeam · 2026-06-01 02:00

/ Page-level: make tables full-width up to 1100px and centered / table { width: 85% !important; max-width: 1100px; margin: 0 auto; } Today we introduce Qwen3.7-Plus — a multimodal agent model that unifies vision and language into a single, versatile agent foundation. Building on …

arXiv cs.AI TIER_1 English(EN) · Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das · 2026-06-18 04:00

搜索与推理解耦：LLM Agent 的一种独立于供应商的接地架构

arXiv:2606.18947v1 Announce Type: new Abstract: Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary…

arXiv cs.LG TIER_1 English(EN) · Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung · 2026-06-18 04:00

入乡随俗：从异构体中学习通用行为

arXiv:2606.18537v1 Announce Type: new Abstract: Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, maki…

arXiv cs.CL TIER_1 English(EN) · Shuang Xie, Yunan Lu, Han Li, Lingyun Wang · 2026-06-18 04:00

EARS：大规模多智能体系统中用于可靠子智能体建模的解释性弃权

arXiv:2606.18668v1 Announce Type: cross Abstract: In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves mod…

arXiv cs.CL TIER_1 English(EN) · Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua · 2026-06-18 04:00

利用大型语言模型通过多智能体虚构博弈增强决策能力

arXiv:2606.19308v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm f…

arXiv cs.CL TIER_1 English(EN) · Paresh Dashore, Shreyas Kulkarni, Uttam Gurram, Nadia Bathaee, Kartik Balasubramaniam, Genta Indra Winata, Sambit Sahu, Shi-Xiong Zhang · 2026-06-18 04:00

面向企业应用的具身多智能体系统的可扩展定制与部署

arXiv:2606.18502v1 Announce Type: new Abstract: Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-sp…

arXiv cs.CL TIER_1 (CA) · Ziyan Jiang, Li An, Yujian Liu, Jiabao Ji, Qiucheng Wu, Jacob Andreas, Yang Zhang, Shiyu Chang · 2026-06-18 04:00

VISUALSKILL: 计算机使用代理的多模态技能

arXiv:2606.18448v1 Announce Type: new Abstract: Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill…

arXiv cs.AI TIER_1 English(EN) · Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze · 2026-06-18 04:00

通过文本反向传播实现自演化多智能体系统

arXiv:2506.09046v3 Announce Type: replace-cross Abstract: Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome…

arXiv cs.AI TIER_1 English(EN) · Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu · 2026-06-18 04:00

InfoPO：面向用户中心代理的信息驱动策略优化

arXiv:2603.00656v2 Announce Type: replace Abstract: Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-…

arXiv cs.AI TIER_1 English(EN) · Myung Ho Kim · 2026-06-18 04:00

大型语言模型智能体的结构化认知循环用于行为智能（扩展修订版：从行为架构到认知问责）

arXiv:2510.05107v5 Announce Type: replace Abstract: The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted…

arXiv cs.AI TIER_1 English(EN) · Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll · 2026-06-18 04:00

大型语言模型Agent通信协议的技术分类法

arXiv:2606.19135v1 Announce Type: cross Abstract: As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the…

arXiv cs.AI TIER_1 English(EN) · Marco Becattini, Niccol\`o Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario · 2026-06-18 04:00

CAPRA：使用多智能体LLM系统扩展软件架构交付物的反馈

arXiv:2606.18976v1 Announce Type: cross Abstract: Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requ…

arXiv cs.AI TIER_1 Svenska(SV) · Hehai Lin, Qi Yang, Chengwei Qin · 2026-06-18 04:00

Skill-MAS：为自动多智能体系统演进元技能

arXiv:2606.18837v1 Announce Type: cross Abstract: Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. …

arXiv cs.AI TIER_1 English(EN) · Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang · 2026-06-18 04:00

LLMZero：通过LLM代理发现RL后训练的自适应训练策略

arXiv:2606.18388v1 Announce Type: cross Abstract: RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shiftin…

arXiv cs.AI TIER_1 English(EN) · Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Has… · 2026-06-18 04:00

迈向代理优先的Web：为AI代理重新设计Web

arXiv:2606.19116v1 Announce Type: new Abstract: The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention,…

arXiv cs.AI TIER_1 English(EN) · Yuchuan Tian, Mengyu Zheng, Haocheng Mei, Ye Yuan, Chao Xu, Xinghao Chen, Hanting Chen, Yu Wang · 2026-06-18 04:00

SafeClawBench：分离工具使用 LLM Agent 的语义、审计证据和沙盒危害

arXiv:2606.18356v1 Announce Type: cross Abstract: Tool-using language-model agents introduce security failures that go beyond unsafe text: they can disclose protected objects, write persistent memory, send messages, modify databases, or trigger harmful code and tool effects. Exis…

arXiv cs.AI TIER_1 English(EN) · Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin · 2026-06-18 04:00

RODS：面向多轮工具使用代理的奖励驱动在线数据合成

arXiv:2606.19047v1 Announce Type: new Abstract: Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of th…

arXiv cs.AI TIER_1 English(EN) · Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen · 2026-06-18 04:00

WorldLines：长时序状态化具身智能体的基准测试与建模

arXiv:2606.18847v1 Announce Type: new Abstract: To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question ans…

arXiv cs.AI TIER_1 English(EN) · Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang · 2026-06-18 04:00

面向GUI智能体的技能引导式续写蒸馏

arXiv:2606.18890v1 Announce Type: new Abstract: Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execu…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tat-Seng Chua · 2026-06-17 17:31

利用大型语言模型通过多智能体虚构博弈增强决策能力

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are als…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 17:31

利用大型语言模型通过多智能体虚构博弈增强决策能力

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are als…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Alois Knoll · 2026-06-17 14:45

大型语言模型Agent通信协议的技术分类法

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a signific…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 14:31

迈向Agent优先的Web：为AI Agent重新设计Web

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The r…

arXiv cs.AI TIER_1 English(EN) · Sachin Shetty · 2026-06-17 14:31

迈向以智能体为先的网络：为 AI 智能体重新设计网络

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The r…

arXiv cs.AI TIER_1 English(EN) · Tao Lin · 2026-06-17 13:13

RODS：面向多轮工具使用代理的奖励驱动在线数据合成

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples n…

arXiv cs.AI TIER_1 English(EN) · Enrico Vicario · 2026-06-17 12:00

CAPRA：利用多智能体LLM系统扩展软件架构交付物的反馈

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully auto…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Sudeep Das · 2026-06-17 11:30

搜索与推理解耦：LLM Agent 的一种独立于供应商的接地架构

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect,…

arXiv cs.AI TIER_1 English(EN) · Daxin Jiang · 2026-06-17 10:07

面向GUI智能体的技能引导式续写蒸馏

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert …

arXiv cs.AI TIER_1 English(EN) · Ying-Cong Chen · 2026-06-17 09:26

WorldLines：长时序状态化具身智能体的基准测试与建模

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on…

arXiv cs.MA (Multiagent) TIER_1 Svenska(SV) · Chengwei Qin · 2026-06-17 09:12

Skill-MAS：为自动多智能体系统演进元技能

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 04:07

EARS：大规模多智能体系统中用于可靠子智能体建模的解释性弃权

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its rel…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Lingyun Wang · 2026-06-17 04:07

EARS：大规模多智能体系统中用于可靠子智能体建模的解释性弃权

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity, scalability, and cost efficiency, its rel…

arXiv cs.CL TIER_1 English(EN) · Xueping Gao · 2026-06-17 04:00

LLM智能体中的组合技能路由：分解、检索与组合

arXiv:2606.18051v1 Announce Type: new Abstract: LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: g…

arXiv cs.CL TIER_1 English(EN) · Tongxu Luo, Rongsheng Wang, Jiaxi Bi, Chenming Xu, Zhengyang Tang, Jianlong Chen, Juhao Liang, Ke Ji, Shuqi Guo, Yuhao Du, Fan Bu, Wenyu Du, Xiaotong Zhang, Kyle Li, Shaobo Wang, Linfeng Zhang, Yuxuan Liu, Xin Lai, Chenxin Li, Yiduo Guo, Zhexin Zhang, Xi… · 2026-06-17 04:00

GameCraft-Bench：智能体能否在真实游戏引擎中端到端地构建可玩游戏？

arXiv:2606.17861v1 Announce Type: new Abstract: Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game…

arXiv cs.CL TIER_1 English(EN) · Rean Clive Fernandes, Lukas Fehring, Theresa Eimer, Marius Lindauer, Matthias Feurer · 2026-06-17 04:00

面向LLM游戏代理的环境引导自动化提示优化

arXiv:2606.17838v1 Announce Type: new Abstract: LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the…

arXiv cs.CL TIER_1 English(EN) · Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo · 2026-06-17 04:00

从学员到训练师：为具有多智能体推理的强化学习设计的LLM训练环境

arXiv:2606.17682v1 Announce Type: new Abstract: Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current…

arXiv cs.CL TIER_1 English(EN) · Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan · 2026-06-17 04:00

OPD-Evolver：通过 On-Policy Distillation 培养整体性智能体进化器

arXiv:2606.17628v1 Announce Type: new Abstract: Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skill…

arXiv cs.AI TIER_1 English(EN) · Muhammad Bilal, Jon Crowcroft, Ruizhi Wang, Xiaolong Xu, Schahram Dustdar · 2026-06-17 04:00

用于智能网络运维和AI运维的大语言模型：架构、评估与安全

arXiv:2605.12729v2 Announce Type: replace-cross Abstract: Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis…

arXiv cs.AI TIER_1 English(EN) · Xiaojun Jia, Jie Liao, Simeng Qin, Jindong Gu, Wenqi Ren, Xiaochun Cao, Yang Liu, Philip Torr · 2026-06-17 04:00

SkillJect：有效自动化基于技能的提示注入，以实现支持技能的代理

arXiv:2602.14211v3 Announce Type: replace-cross Abstract: Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be rep…

arXiv cs.AI TIER_1 English(EN) · Ankita Samaddar, Sandeep Neema, Daniel Balasubramanian, Xenofon Koutsoukos · 2026-06-17 04:00

从观察中学习红方智能体策略以实现神经符号自主网络智能体

arXiv:2606.18223v1 Announce Type: cross Abstract: With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as be…

arXiv cs.AI TIER_1 English(EN) · Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby · 2026-06-17 04:00

A Framework for Evaluating Agentic Skills at Scale

arXiv:2606.17819v1 Announce Type: cross Abstract: Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-stu…

arXiv cs.AI TIER_1 English(EN) · Ander Alvarez, Santhiya Rajan, Samuel Mugel, Rom\'an Or\'us · 2026-06-17 04:00

ProvenanceGuard：基于MCP的LLM代理的源感知事实性验证

arXiv:2606.18037v1 Announce Type: new Abstract: Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually tes…

arXiv cs.AI TIER_1 English(EN) · Bojie Li · 2026-06-17 04:00

PreAct：重复任务中速度更快的计算机使用代理

arXiv:2606.17929v1 Announce Type: new Abstract: Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again.…

arXiv cs.AI TIER_1 English(EN) · Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang · 2026-06-17 04:00

SEAGym：面向自进化LLM智能体的评估环境

arXiv:2606.17546v1 Announce Type: new Abstract: Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Exi…

arXiv cs.AI TIER_1 English(EN) · Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras · 2026-06-17 04:00

通过智能体轨迹剖析模型行为

arXiv:2606.17454v1 Announce Type: new Abstract: AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior ca…

arXiv cs.AI TIER_1 English(EN) · Shengli Zhang, Deen Ma, Zibin Lin, Taotao Wang · 2026-06-17 04:00

分布式通用智能体网络：架构、关键机制与原型

arXiv:2606.17368v1 Announce Type: new Abstract: Large language models have accelerated the transition from passive conversational assistants to autonomous agents that can understand goals, plan actions, invoke tools, and execute multi-step tasks. Yet the capability of a single ag…

arXiv cs.AI TIER_1 English(EN) · Sidhaarth Murali, Jo\~ao Coelho, Jingjie Ning, Jo\~ao Magalh\~aes, Bruno Martins, Chenyan Xiong · 2026-06-17 04:00

超越并行采样：为智能体搜索实现多样化查询初始化

arXiv:2606.17209v1 Announce Type: new Abstract: Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Xiuxiu Qi · 2026-06-17 03:03

PersonalPlan：为个性化编程学习规划多智能体系统

Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 00:00

RODS：面向多轮工具使用代理的奖励驱动在线数据合成

RODS addresses sample depletion in multi-turn tool-use reinforcement learning by dynamically synthesizing new data based on reward variance to maintain informative training samples.

arXiv cs.CL TIER_1 English(EN) · Shi-Xiong Zhang · 2026-06-16 21:30

面向企业应用的具身多智能体系统的可扩展定制与部署

Large language model (LLM)-based multi-agent systems demonstrate strong performance on complex reasoning and task execution, enabling broad enterprise applications. However, production deployment remains challenging due to domain-specific customization requirements and high laten…

arXiv cs.CL TIER_1 (CA) · Shiyu Chang · 2026-06-16 19:57

VISUALSKILL: 计算机使用代理的多模态技能

Computer-use agents (CUAs) approach human-level performance on standardised benchmarks but still struggle on long-horizon tasks and unseen software. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual natur…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Bernie Wang · 2026-06-16 18:33

LLMZero：通过LLM代理发现RL训练后自适应训练策略

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters beca…

arXiv cs.AI TIER_1 English(EN) · Xenofon Koutsoukos · 2026-06-16 17:50

从观察中学习Red Agent策略以实现神经符号自主网络代理

With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LEC…

arXiv cs.CL TIER_1 English(EN) · Xueping Gao · 2026-06-16 15:27

LLM智能体中的组合技能路由：分解、检索与组合

LLM agents increasingly rely on external skills -- reusable tool specifications -- but real-world tasks often require composing multiple skills, not just selecting one. We formalize this as the Compositional Skill Routing problem: given a complex user query and a large skill libr…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Román Orús · 2026-06-16 15:10

ProvenanceGuard：基于MCP的LLM代理的源感知事实性验证

Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evide…

arXiv cs.AI TIER_1 English(EN) · Bojie Li · 2026-06-16 13:40

PreAct：重复任务中速度更快的计算机使用代理

Computer-using agents drive real software through the screen -- clicking and typing -- but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get…

arXiv cs.CL TIER_1 English(EN) · Benyou Wang · 2026-06-16 12:34

GameCraft-Bench：AI智能体能否在真实游戏引擎中端到端地构建可玩游戏？

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, renderin…

arXiv cs.CL TIER_1 English(EN) · Matthias Feurer · 2026-06-16 12:06

面向LLM游戏代理的环境导向自动化提示优化

LLM agents in interactive environments are highly sensitive to their prompts, yet prompt engineering remains a manual, task-specific process. We introduce an automated prompt optimization framework for LLM agents that decomposes the observation-to-action pipeline into a goal-cond…

arXiv cs.CL TIER_1 English(EN) · Rob Willoughby · 2026-06-16 11:46

大规模评估 Agentic Skills 的框架

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evalu…

arXiv cs.CL TIER_1 English(EN) · Zhijiang Guo · 2026-06-16 08:48

从训练生到训练师：为具有多智能体推理的强化学习设计的LLM训练环境

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose th…

arXiv cs.CL TIER_1 English(EN) · Shuicheng Yan · 2026-06-16 07:33

OPD-Evolver：通过 On-Policy Distillation 培养全方位智能体进化器

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to sel…

arXiv cs.AI TIER_1 English(EN) · Wasi Uddin Ahmad, Nikolai Ludwig, Somshubra Majumdar, Boris Ginsburg · 2026-06-16 04:00

Open-SWE-Traces：推进软件工程代理的双模态多语言蒸馏

arXiv:2606.16038v1 Announce Type: cross Abstract: The path toward autonomous software engineering is currently bottlenecked by a severe deficit of diverse, large-scale trajectory data. We address this by introducing \ourdataset, an expansive dataset of 207,489 agentic trajectorie…

arXiv cs.LG TIER_1 English(EN) · Faramarz Jabbarvaziri · 2026-06-16 04:00

切记，不要重复阅读：用于令牌高效自主实验的状态化 ReAct 代理

arXiv:2606.14945v1 Announce Type: new Abstract: The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at…

arXiv cs.CL TIER_1 English(EN) · Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov · 2026-06-16 04:00

MyPCBench：个人智能电脑使用代理基准测试

arXiv:2606.16748v1 Announce Type: cross Abstract: Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, includin…

arXiv cs.CL TIER_1 English(EN) · Haoqin Tu, Jianwen Chen, Zijun Wang, Siwei Han, Juncheng Wu, Hardy Chen, Haonian Ji, Kaiwen Xiong, Jiaqi Liu, Peng Xia, Jieru Mei, Hongliang Fei, Jason Eshraghian, Zeyu Zheng, Yuyin Zhou, Huaxiu Yao, Cihang Xie · 2026-06-16 04:00

VisualClaw：面向物理世界的实时个性化代理

arXiv:2606.16295v1 Announce Type: cross Abstract: Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prom…

arXiv cs.CL TIER_1 English(EN) · Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath, Prateek Mittal, Xingyu Fu · 2026-06-16 04:00

面向Agentic和多模态大模型的上下文感知强化学习

arXiv:2606.17053v1 Announce Type: new Abstract: Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose Co…

arXiv cs.CL TIER_1 English(EN) · Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song · 2026-06-16 04:00

SING：用于LLM智能体中可扩展主动工具发现的合成意图图

arXiv:2606.16591v1 Announce Type: new Abstract: Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ec…

arXiv cs.CL TIER_1 English(EN) · Reef Menaged, Gili Lior, Shauli Ravfogel, Roee Aharoni, Gabriel Stanovsky · 2026-06-16 04:00

大型语言模型代理能否推断世界模型？来自代理自动机学习的证据

arXiv:2606.16576v1 Announce Type: new Abstract: We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by…

arXiv cs.CL TIER_1 Dansk(DA) · Dingcheng Huang, Yuda Ding, Bingshuo Liu, Qingbin Liu, Xi Chen, Jiang Bian, Hongliang Sun, Zhiying Tu, Dianhui Chu, Xiaoyan Yu, Dianbo Sui · 2026-06-16 04:00

SkillWiki：Agent技能的动态知识基础设施

arXiv:2606.16523v1 Announce Type: new Abstract: While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports …

arXiv cs.CL TIER_1 English(EN) · Junyi Li, Xiaowei Qian, Yingyi Zhang, Wenlin Zhang, Guojing Li, Sheng Zhang, Xiao Han, Yichao Wang, Xiangyu Zhao · 2026-06-16 04:00

迈向具有帕累托排序策略优化的帕累托最优工具集成智能体

arXiv:2606.16111v1 Announce Type: new Abstract: Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking au…

arXiv cs.CL TIER_1 English(EN) · Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasya Palienko · 2026-06-16 04:00

GRACE-DS：数据科学中的受保护奖励引导代理校正环境

arXiv:2606.16000v1 Announce Type: new Abstract: We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be a…

arXiv cs.CL TIER_1 English(EN) · Sina Hajimiri, Masih Aminbeidokhti, Jose Dolz, Ismail Ben Ayed, Issam H. Laradji, Spandana Gella, Nicolas Gontier · 2026-06-16 04:00

在线技能和记忆模块是否总是物有所值？一项关于网络代理的预算约束研究

arXiv:2606.15017v1 Announce Type: new Abstract: Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We stu…

arXiv cs.AI TIER_1 English(EN) · Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, Feiyu Xiong, Yuyu Luo, Zhiyu Li · 2026-06-16 04:00

SkillsVote：从收集、推荐到演进的智能体技能生命周期治理

arXiv:2605.18401v2 Announce Type: replace-cross Abstract: Long-horizon LLM agents generate traces that could become reusable experience, but raw trajectories are noisy, local, and hard to govern. Agent Skills offer a structured artifact for combining procedural guidance, executab…

arXiv cs.AI TIER_1 English(EN) · Hongwei Yao, Yiming Liu, Yiling He, Bingrun Yang · 2026-06-16 04:00

红队代理执行上下文：OpenClaw上的开放世界安全评估

arXiv:2605.11047v2 Announce Type: replace-cross Abstract: Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents Dee…

arXiv cs.AI TIER_1 English(EN) · Yifan Sui, Han Zhao, Rui Ma, Zhiyuan He, Hao Wang, Jianxun Li, Kaiqiang Xu, Kai Chen, Yuqing Yang · 2026-06-16 04:00

并行化工具执行和LLM生成以实现低延迟Agent服务

arXiv:2603.18897v2 Announce Type: replace-cross Abstract: LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PA…

arXiv cs.AI TIER_1 English(EN) · Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang · 2026-06-16 04:00

RollArt：大规模解耦多任务代理强化学习训练

arXiv:2512.22560v2 Announce Type: replace-cross Abstract: Agentic Reinforcement Learning (RL) trains LLMs through multi-turn interactions with environments, producing workloads that mix compute-bound prefill, bandwidth-bound decoding, CPU-heavy environment execution, and bursty r…

arXiv cs.AI TIER_1 English(EN) · Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez · 2026-06-16 04:00

Agentic Security 应用、威胁与防御综述

arXiv:2510.06445v3 Announce Type: replace-cross Abstract: LLM-based agents are now used throughout cybersecurity. While these agents facilitate powerful and autonomous security applications, their autonomy opens up new attack surfaces, and the security community is actively build…

arXiv cs.AI TIER_1 Deutsch(DE) · Shawn Li, Chenxiao Yu, Han Wang, Wei Yang, Ryan Rossi, Franck Dernoncourt, Xiyang Hu, Philip Yu, Chaowei Xiao, Huan Zhang, Yue Zhao · 2026-06-16 04:00

FORTIS：对代理技能中的过度特权进行基准测试

arXiv:2605.09163v3 Announce Type: replace Abstract: Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it i…

arXiv cs.AI TIER_1 English(EN) · Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran You, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, Qunhong Zeng, … · 2026-06-16 04:00

SkillsBench：在多样化任务中评估代理技能的有效性

arXiv:2602.12670v4 Announce Type: replace Abstract: Agent Skills are structured packages of procedural knowledge that augment large language model (LLM) agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present Sk…

arXiv cs.AI TIER_1 English(EN) · So Kuroki, Yingtao Tian, Kou Misaki, Takashi Ikegami, Takuya Akiba, Yujin Tang · 2026-06-16 04:00

Shachi：一个用于基于 LLM 的智能体建模涌现集体行为的模块化、可控框架

arXiv:2509.21862v3 Announce Type: replace Abstract: How collective behaviors emerge from the interactions of individual LLM-driven agents is a central question in artificial life, yet controlled study of these emergent dynamics has been hindered by the lack of a principled simula…

arXiv cs.AI TIER_1 English(EN) · Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang · 2026-06-16 04:00

TokenPilot：LLM代理的高效缓存上下文管理

arXiv:2606.17016v1 Announce Type: cross Abstract: As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained se…

arXiv cs.AI TIER_1 English(EN) · Jiajie Jin, Zhao Yang, Wenle Liao, Yuyang Hu, Guanting Dong, Xiaoxi Li, Yutao Zhu, Zhicheng Dou · 2026-06-16 04:00

VeriGraph: 迈向可验证的数据分析代理

arXiv:2606.16603v1 Announce Type: cross Abstract: LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, de…

arXiv cs.AI TIER_1 English(EN) · Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee · 2026-06-16 04:00

PACT：多轮工具使用代理的特权追踪协同训练

arXiv:2606.16215v1 Announce Type: cross Abstract: Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit …

arXiv cs.AI TIER_1 English(EN) · Yixuan Wang, Yiyang Zhou, Yiming Liang, Congyu Zhang, Fuxiao Liu, Jiawei Zhou, Huaxiu Yao · 2026-06-16 04:00

并非所有技能都有帮助：衡量和修复代理知识

arXiv:2606.15390v1 Announce Type: cross Abstract: LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue tha…

arXiv cs.AI TIER_1 English(EN) · Hongtao Lyu, Dingyan Zhang, Mingyu Wu, Xingda Wei, Haibo Chen · 2026-06-16 04:00

CoAgent: 多智能体系统的并发控制

arXiv:2606.15376v1 Announce Type: cross Abstract: Multi-agent LLM systems -- coding agents, devops agents, document agents -- now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they …

arXiv cs.AI TIER_1 English(EN) · Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, Yevgeniy Vorobeychik · 2026-06-16 04:00

AutoDojo：自适应攻击暴露了大型语言模型代理中表面防御和用户未指定限制的问题

arXiv:2606.15057v1 Announce Type: cross Abstract: Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI. These can be grouped into three broad categories: 1) promp…

arXiv cs.AI TIER_1 English(EN) · Kirill Vasilevski (Justina), Ximing Dong (Justina), Benjamin Rombaut (Justina), Ruochen Deng (Justina), Jiahuei Lin (Justina), Arthur Leung, Dayi Lin, Boyuan Chen, Shaowei Wang, Ahmed E. Hassan · 2026-06-16 04:00

超越正确性：通过可扩展的代理判断标注增强代码大模型中的架构推理

arXiv:2606.14948v1 Announce Type: cross Abstract: LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. W…

arXiv cs.AI TIER_1 English(EN) · Andoni Rodr\'iguez, Alberto Pozanco, Daniel Borrajo · 2026-06-16 04:00

您的Agent在装死吗？已部署的LLM Agent表现出约束规避的虚构和装死行为

arXiv:2606.14831v1 Announce Type: cross Abstract: This paper presents and characterizes a spectrum of previously unreported behaviours we term Constraint-Evasive Fabrication (CEF): when an LLM agent operates under irreconcilable constraints (where no response can simultaneously s…

arXiv cs.AI TIER_1 English(EN) · Dong Ho Kang, Hyeonjeong Cha, Daein Weon · 2026-06-16 04:00

基于知识的多智能体LLM轨迹零重放调试

arXiv:2606.14805v1 Announce Type: cross Abstract: Reliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool c…

arXiv cs.AI TIER_1 English(EN) · Hanqi Li, Jing Peng, Zijian Wang, Lu Chen, Kai Yu · 2026-06-16 04:00

XFlow：一个用于可靠多代理工作流的可执行协议编程系统

arXiv:2606.14790v1 Announce Type: cross Abstract: LLM-based multi-agent systems increasingly coordinate planning, reasoning, tool use, and human interaction, yet their reliability remains limited. A central source of this limitation is the underspecified prompt--harness boundary.…

arXiv cs.AI TIER_1 English(EN) · Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang · 2026-06-16 04:00

事实核查：多智能体协作下的可行性感知长期行动预测

arXiv:2606.14778v1 Announce Type: cross Abstract: Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasi…

arXiv cs.AI TIER_1 English(EN) · Rahul Suresh Babu, Rohit Shukla · 2026-06-16 04:00

GIST-CMTF：LLM智能体因果最小工具过滤的目标状态推理

arXiv:2606.16813v1 Announce Type: new Abstract: Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier,…

arXiv cs.AI TIER_1 English(EN) · Tianyi Zhang, Zhonghao Qi · 2026-06-16 04:00

Skill-to-LoRA：从使用技能到学习行为，实现高效率LLM代理的Token优化

arXiv:2606.16769v1 Announce Type: new Abstract: Agent skills are commonly distributed as SKILL.md files: human-readable procedural documents that describe workflows, tools, resources, and domain conventions. While convenient for inspection and reuse, this design requires the same…

arXiv cs.AI TIER_1 English(EN) · Issa Sugiura, Daichi Hattori, Kazuo Araragi, Keita Ogawa, Shota Onose, Taro Makino, Teppei Usuki, Takashi Ishida · 2026-06-16 04:00

CoffeeBench：在异构多智能体经济中对长时域 LLM 智能体进行基准测试

arXiv:2606.16613v1 Announce Type: new Abstract: As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with…

arXiv cs.AI TIER_1 English(EN) · Shiyang Chen · 2026-06-16 04:00

观察而非选择：LLM智能体工具选择失败的注意力-分割账户

arXiv:2606.16364v1 Announce Type: new Abstract: LLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside -- the model's attention to labeled tool-definition…

arXiv cs.AI TIER_1 English(EN) · Bing Hao, Ruijie Wang, Haodong Qian, Yunlong Chu, Yuhang Liu, Yumeng Lin, Minglai Shao, Jianxin Li · 2026-06-16 04:00

AdaSTORM：通过自适应时空多智能体协作扩展动态图上的LLM推理

arXiv:2606.16328v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead …

arXiv cs.AI TIER_1 English(EN) · Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu, Mayank Malhotra, Arup Das, Jitesh Chandra, Yun-Shiuan Chuang, Chaitanya Kulkarni, Arun Menon, Linsey Pang, Avinash Karn, Mouli V, Prakhar Mehrotra · 2026-06-16 04:00

面向工具增强大语言模型的基于状态的合成数据生成

arXiv:2606.16307v1 Announce Type: new Abstract: Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We presen…

arXiv cs.AI TIER_1 English(EN) · Junjia Qi, Zichuan Fu, Jingtong Gao, Wenlin Zhang, Hanyu Yan, Xian Wu, Xiangyu Zhao · 2026-06-16 04:00

LLM-as-Code Agentic Programming for Agent Harness

arXiv:2606.15874v1 Announce Type: new Abstract: Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control-flow hallucination, and unreliable completion ar…

arXiv cs.AI TIER_1 English(EN) · Pavel Surynek · 2026-06-16 04:00

编译式多智能体寻路中的未分配智能体

arXiv:2606.15797v1 Announce Type: new Abstract: Compilation-based techniques represent an important stream of solvers for multi-agent path finding (MAPF) due to their modularity and adaptability for non-standard variants of the problem. While in the standard MAPF the task is to n…

arXiv cs.AI TIER_1 English(EN) · Juheon Yi, Jinglu Wang, Xiaoyi Zhang, Yan Lu · 2026-06-16 04:00

用于 Minecraft 中时间敏感互补协作的多智能体框架

arXiv:2606.15684v1 Announce Type: new Abstract: We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent hetero…

arXiv cs.AI TIER_1 English(EN) · Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim · 2026-06-16 04:00

哪里出了问题？基于语义状态跟踪的流程级Web Agent评估

arXiv:2606.15673v1 Announce Type: new Abstract: Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level anal…

arXiv cs.AI TIER_1 English(EN) · Sidi Deng · 2026-06-16 04:00

您的代理拥有基因组：LLM驱动的自主代理的序列级行为分析与运行时治理

arXiv:2606.15579v1 Announce Type: new Abstract: We propose Base Sequence Analysis, a framework that encodes the runtime behavior of LLM-powered autonomous agents into compact symbolic sequences using a four-letter alphabet: X (Explore), E (Execute), P (Plan), and V (Verify). Draw…

arXiv cs.AI TIER_1 English(EN) · Rahul Suresh Babu, Laxmipriya Ganesh Iyer · 2026-06-16 04:00

ToolMenuBench：为可靠高效的LLM代理基准测试工具菜单过滤策略

arXiv:2606.15508v1 Announce Type: new Abstract: Tool-augmented large language model agents increasingly operate over large tool libraries, but existing evaluations often focus on whether a model can call a tool correctly rather than how the visible tool menu shapes reliability, e…

arXiv cs.AI TIER_1 English(EN) · Sanhorn Chen, Xiaoyang Chen, Boyu Liu, Roy Zhao · 2026-06-16 04:00

迈向量化可验证的智能体数据科学：通过工具支撑的推理解决不规则时间序列问答

arXiv:2606.15107v1 Announce Type: new Abstract: Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However,…

arXiv cs.AI TIER_1 English(EN) · Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis · 2026-06-16 04:00

PrologMCP: LLM 智能体标准化 Prolog 工具接口

arXiv:2606.14935v1 Announce Type: new Abstract: Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language mo…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-16 00:00

从学员到训练师：为具有多智能体推理的强化学习设计的LLM训练环境

A framework automates environment redesign in reinforcement learning for large language models by having the policy analyze failures and suggest configuration changes, achieving superior performance over larger proprietary models and fixed-environment baselines.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-16 00:00

GameCraft-Bench：智能体能否在真实游戏引擎中端到端地构建可玩游戏？

End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-16 00:00

OPD-Evolver：通过 On-Policy Distillation 培养整体式智能体演化器

OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains.

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Chenyan Xiong · 2026-06-15 18:48

超越并行采样：为智能体搜索实现多样化查询初始化

Test-time scaling for agentic search typically increases depth (i.e., more turns and tokens per trajectory) or breadth (i.e., more parallel rollouts). Here we focus on breadth scaling, showing that standard parallel sampling yields diminishing returns, tracing this to query redun…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Ningyu Zhang · 2026-06-15 17:46

TokenPilot：LLM代理的高效缓存上下文管理

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix…

arXiv cs.AI TIER_1 English(EN) · Rohit Shukla · 2026-06-15 14:57

GIST-CMTF：LLM智能体因果最小工具过滤的目标状态推理

Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has alread…

arXiv cs.AI TIER_1 English(EN) · Zhonghao Qi · 2026-06-15 14:17

Skill-to-LoRA：从使用技能到学习行为，实现高效率 LLM Agent 的 Token 优化

Agent skills are commonly distributed as SKILL.md files: human-readable procedural documents that describe workflows, tools, resources, and domain conventions. While convenient for inspection and reuse, this design requires the same reusable procedure to be repeatedly injected in…

arXiv cs.CL TIER_1 English(EN) · Ruslan Salakhutdinov · 2026-06-15 14:08

MyPCBench：个人智能电脑使用代理基准测试

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in ac…

arXiv cs.AI TIER_1 English(EN) · Takashi Ishida · 2026-06-15 12:04

CoffeeBench：在异构多智能体经济中对长时域 LLM 智能体进行基准测试

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inh…

arXiv cs.AI TIER_1 English(EN) · Zhicheng Dou · 2026-06-15 11:50

VeriGraph：迈向可验证的数据分析代理

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semanti…

arXiv cs.CL TIER_1 English(EN) · Yangqiu Song · 2026-06-15 11:37

SING：用于LLM代理中可扩展主动工具发现的合成意图图

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs…

arXiv cs.CL TIER_1 English(EN) · Gabriel Stanovsky · 2026-06-15 11:23

大型语言模型代理能否推断世界模型？来自代理自动机学习的证据

We propose agentic automata learning to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membersh…

arXiv cs.CL TIER_1 Dansk(DA) · Dianbo Sui · 2026-06-15 10:24

SkillWiki：Agent技能的动态知识基础设施

While knowledge is managed through Wikipedia and software through GitHub, agent skills still lack an infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure that supports the organization, grounding, and continuous evol…

arXiv cs.CL TIER_1 English(EN) · Prakhar Mehrotra · 2026-06-15 07:13

面向工具增强大语言模型的基于状态的合成数据生成

Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform…

arXiv cs.CL TIER_1 English(EN) · Wenke Lee · 2026-06-15 04:46

PACT：用于多轮工具使用代理的特权追踪协同训练

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only infere…

arXiv cs.AI TIER_1 English(EN) · Olly Styles · 2026-06-15 04:00

WorkBench 回顾：两年后的工作场所智能体

arXiv:2606.13715v1 Announce Type: new Abstract: The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent t…

arXiv cs.CL TIER_1 English(EN) · Kang Zhou, Sangmin Woo, Haibo Ding, Kiran Ramnath, Subramanian Chidambaram, Aosong Feng, Vinayak Arannil, Muhyun Kim, Ishan Singh, Darren Wang, Zhichao Xu, Megha Gandhi, Nirmal Prabhu, Soumya Smruti Mishra, Vivek Singh, Gouri Pandeshwar, Lin Lee Cheong · 2026-06-15 04:00

自动化代理评估的实证研究

arXiv:2605.11378v2 Announce Type: replace Abstract: Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automat…

arXiv cs.CL TIER_1 English(EN) · Tan Zhu, Tong Yao, Kananart Kuwaranancharoen, Amit Singh, Yushang Lai, Deepa Mohan, Shankara Bhargava · 2026-06-15 04:00

面向多LLM智能体系统的基于图的目标反向传播用于上下文自适应

arXiv:2606.14155v1 Announce Type: cross Abstract: Context adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: exi…

arXiv cs.CL TIER_1 English(EN) · Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang, Soham Bose, Yiming Zhang, Leon Leng, Amit Vyas, Lingjun Mao, Siru Ouyang, Kun Zhou, Lianhui Qin · 2026-06-15 04:00

AgentSpec：通过受控组合理解具身智能体脚手架

arXiv:2606.14674v1 Announce Type: new Abstract: LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedd…

arXiv cs.CL TIER_1 English(EN) · Xinbei Ma, Congmin Zheng, Jiyang Qiu, Jiale Hong, Yao Yao, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao · 2026-06-15 04:00

用于 LLM 代理训练的回顾性进度感知自我完善

arXiv:2606.14302v1 Announce Type: new Abstract: LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progres…

arXiv cs.CL TIER_1 English(EN) · Md Amirul Islam, Sumiran Thakur, Huancheng Chen, Su Min Park, Jiayun Wang, Gyuhak Kim · 2026-06-15 04:00

CacheRL：通过缓存回放和混合奖励实现多轮工具调用代理

arXiv:2606.14179v1 Announce Type: new Abstract: We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach …

arXiv cs.AI TIER_1 English(EN) · Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen · 2026-06-15 04:00

MASLab：基于LLM的多智能体系统的统一且全面的代码库

arXiv:2505.16988v2 Announce Type: replace-cross Abstract: LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unif…

arXiv cs.AI TIER_1 English(EN) · Minseo Kim · 2026-06-15 04:00

tap：一种用于异构 LLM 代理协作的基于文件的协议

arXiv:2606.14445v1 Announce Type: cross Abstract: Existing multi-agent software development systems have proposed many forms of agent collaboration, including role-based collaboration and automated code review. However, many systems assume a common runtime, a central conversation…

arXiv cs.AI TIER_1 English(EN) · Rui Melo, Riccardo Fogliato, Sean Zhou, Pratiksha Thaker, Zhiwei Steven Wu · 2026-06-15 04:00

SEVRA-BENCH：审查代理中的社会工程漏洞

arXiv:2606.13757v1 Announce Type: cross Abstract: Large language model (LLM) reviewers are increasingly used in pull-request (PR) workflows, where their approvals help decide which code is merged into a repository. This raises a question that benchmarks for static vulnerability d…

arXiv cs.AI TIER_1 English(EN) · Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li · 2026-06-15 04:00

面向LLM-Agent工作流中并行分支的直接潜在空间合成

arXiv:2606.14672v1 Announce Type: new Abstract: Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independ…

arXiv cs.AI TIER_1 English(EN) · Zhongyuan Wang, Pratyusha Vemuri · 2026-06-15 04:00

当工具决定一切：LLM 代理盲目依赖图神经网络工具，更强的骨干模型依赖性更强

arXiv:2606.14476v1 Announce Type: new Abstract: A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expo…

arXiv cs.AI TIER_1 English(EN) · Xinbei Ma, Jiyang Qiu, Yao Yao, Zheng Wu, Yijie Lu, Xiangmou Qu, Jiaxin Yin, Xingyu Lou, Jun Wang, Weiwen Liu, Weinan Zhang, Zhuosheng Zhang, Hai Zhao · 2026-06-15 04:00

主动式LLM代理的沟通策略演进

arXiv:2606.14314v1 Announce Type: new Abstract: LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investiga…

arXiv cs.AI TIER_1 English(EN) · Tingyang Chen, Shuo Lu, Kang Zhao, Weicheng Meng, Hanlin Teng, Tianhao Li, Chao Li, Xule Liu, Jian Liang, Zhizhong Zhang, Yuan Xie, Heng Qu, Kun Shao, Jian Luan · 2026-06-15 04:00

HarnessX：一个可组合、自适应、可演进的Agent Harness铸造厂

arXiv:2606.14249v1 Announce Type: new Abstract: AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and stat…

arXiv cs.AI TIER_1 English(EN) · Yinglun Zhu · 2026-06-15 04:00

弥合反思差距：为 Agentic RL 提供免费校准奖励

arXiv:2606.14211v1 Announce Type: new Abstract: LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to…

arXiv cs.AI TIER_1 English(EN) · Yihan Xia, Taotao Wang · 2026-06-15 04:00

何时应使智能体信任条件化？智能体群体中技能条件化声誉的特征描述与攻击

arXiv:2606.14200v1 Announce Type: new Abstract: Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The stan…

arXiv cs.AI TIER_1 English(EN) · Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, Vasily Ilin · 2026-06-15 04:00

形式化数值分析：超越内核接受的代理管道和质量审计

arXiv:2606.14000v1 Announce Type: new Abstract: Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solel…

arXiv cs.AI TIER_1 English(EN) · Laxmipriya Ganesh Iyer, Rahul Suresh Babu · 2026-06-15 04:00

能力最小化作为安全基元：风险感知因果门控用于最小特权LLM代理

arXiv:2606.13884v1 Announce Type: new Abstract: Modern decision systems increasingly rely on learned components whose outputs may be confident yet wrong, exposing downstream actions to costly errors. We introduce Risk-Aware Causal Gating (RACG), a framework that decides whether t…

arXiv cs.AI TIER_1 Italiano(IT) · Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Hao Wu, Jinyang Wu, Donghao Zhou, Zhihong Zhu, Zheng Lian, Xin Wang, Pheng-Ann Heng · 2026-06-15 04:00

Orchestra-o1: 全模态代理编排

arXiv:2606.13707v1 Announce Type: new Abstract: The recent success of agent swarms has shifted the paradigm of large language model (LLM)-based agents from single-agent workflows to multi-agent systems, highlighting the importance of agent orchestration for task decomposition and…

arXiv cs.LG TIER_1 English(EN) · Shi Pan, Ming Luo · 2026-06-15 04:00

任务结构如何限制多智能体成功：一项信息论分析

arXiv:2606.13733v1 Announce Type: cross Abstract: Multi-agent systems (MAS) were expected to overcome the limitation of single-agent systems (SAS) through collaboration. However, under typicality conditions on the task's constraint graph and bounded inter-agent communication, we …

arXiv cs.LG TIER_1 English(EN) · Mykola Vysotskyi, Runqi Lin, Grzegorz Biziel, Michal Zakrzewski, Sebastian Montagna, Damian Rynczak, Shreyansh Padarha, Kumail Alhamoud, Zihao Fu, William Lugoloobi, Kai Rawal, Hanna Yershova, Xander Davies, Taras Rumezhak, Guohao Li, Fazl Barez, Baoyuan… · 2026-06-15 04:00

穿越考验：重新评估代理在熟悉环境之外的能力

arXiv:2606.14397v1 Announce Type: new Abstract: As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-15 00:00

TokenPilot：LLM代理的高效缓存上下文管理

TokenPilot is a dual-granularity context management framework that reduces inference costs in long-horizon LLM sessions by stabilizing prompt prefixes and conservatively managing context segments.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-15 00:00

MyPCBench：个人智能电脑使用代理基准测试

MyPCBench evaluates computer-use agents as personal assistants in a simulated Linux desktop environment with real-world web applications, revealing that Claude Opus 4.6 achieves the highest task completion rate of 55.4% while struggles with multi-application tasks and long trajec…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-15 00:00

VisualClaw：面向物理世界的实时个性化代理

VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution while improving video-QA accuracy across multiple benchmarks.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Haibo Chen · 2026-06-13 16:15

CoAgent: 多智能体系统的并发控制

Multi-agent LLM systems -- coding agents, devops agents, document agents -- now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has…

arXiv cs.CL TIER_1 English(EN) · Lianhui Qin · 2026-06-12 17:39

AgentSpec：通过受控组合理解具身智能体脚手架

LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it diffi…

arXiv cs.AI TIER_1 English(EN) · Pan Li · 2026-06-12 17:39

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence…

Alignment Forum TIER_1 English(EN) · bilalchughtai · 2026-06-12 17:14

构建和评估模型差异代理

<p><i><span>This is the second in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found </span></i><a href="https://www.lesswrong.com/posts/aTcsN5ZZDnMFJvRiG/models-may-behav…

arXiv cs.AI TIER_1 English(EN) · Pratyusha Vemuri · 2026-06-12 14:13

当工具决定一切：LLM 代理盲目依赖图神经网络工具，更强的骨干模型依赖程度更高

A growing line of work equips large language model (LLM) agents with graph neural networks (GNNs) as callable tools, assuming the agent exercises judgment over when and how much to rely on such a tool. We test this directly. We expose a frozen GNN to a ReAct-style LLM agent as an…

arXiv cs.AI TIER_1 English(EN) · Minseo Kim · 2026-06-12 13:28

tap：一种用于异构 LLM Agent 协作的基于文件的协议

Existing multi-agent software development systems have proposed many forms of agent collaboration, including role-based collaboration and automated code review. However, many systems assume a common runtime, a central conversation server, or the same API family. Under these assum…

arXiv cs.LG TIER_1 English(EN) · Adel Bibi · 2026-06-12 12:32

穿越考验：重新评估代理在熟悉环境之外的能力

As agentic systems continue to evolve and are widely deployed in real-world scenarios, there is a growing demand to faithfully evaluate their capabilities. However, current benchmarks are typically built on popular applications with relatively simple tasks and focus on a narrow s…

arXiv cs.AI TIER_1 English(EN) · Hai Zhao · 2026-06-12 09:54

主动式LLM代理的沟通策略演进

LLM agents have rapidly evolved into autonomous systems, yet a persistent information gap remains between users and agents: communication is costly, while users' identical preferences further limit information exchange. To investigate how agents should communicate across modaliti…

arXiv cs.CL TIER_1 English(EN) · Hai Zhao · 2026-06-12 09:38

用于 LLM 代理训练的回顾性进度感知自我完善

LLM-based agents trained with reinforcement learning optimize step-wise action prediction but lack metacognitive awareness of task progress, inducing a gap that hinders long-horizon scaling. A pilot study reveals that online progress prompting hurts performance while retrospectiv…

arXiv cs.AI TIER_1 English(EN) · Jian Luan · 2026-06-12 08:27

HarnessX：一个可组合、自适应且可演进的Agent Harness铸造厂

AI agent performance depends critically on the runtime harness, comprising the prompts, tools, memory, and control flow that mediate how a model observes, reasons, and acts. Yet today's harnesses remain largely hand-crafted and static: each new model or task still demands bespoke…

arXiv cs.AI TIER_1 English(EN) · Yinglun Zhu · 2026-06-12 07:47

弥合反思差距：为 Agentic RL 提供免费校准奖励

LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to accurately assess its own performance. Yet we f…

arXiv cs.AI TIER_1 English(EN) · Taotao Wang · 2026-06-12 07:32

何时应使智能体信任条件化？智能体群体中技能条件化声誉的特征描述与攻击

Open platforms increasingly route tasks among heterogeneous LLM agents--differing in base model, scaffold, and tool stack--whose competence varies sharply by skill: an agent excellent at one skill may be useless at another. The standard reputation approach summarizes each agent b…

arXiv cs.CL TIER_1 English(EN) · Gyuhak Kim · 2026-06-12 07:01

CacheRL：通过缓存回放和混合奖励实现多轮工具调用代理

We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent tr…

arXiv cs.CL TIER_1 English(EN) · Shankara Bhargava · 2026-06-12 06:27

面向多LLM智能体系统的基于图的目标反向传播用于上下文自适应

Context adaptation automates prompt engineering in LLM-based systems by iteratively revising tunable prompts from task feedback, without modifying model weights. Extending this paradigm to multi-LLM agentic systems is crucial: existing methods suffer from inaccurate credit assign…

arXiv cs.AI TIER_1 English(EN) · Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou · 2026-06-12 04:00

面向高质量多样化网络代理模仿的推测性回滚校正

arXiv:2606.12485v1 Announce Type: cross Abstract: Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this …

arXiv cs.CL TIER_1 English(EN) · Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye · 2026-06-12 04:00

ProPlay：用于自演化 LLM 智能体的程序化世界模型

arXiv:2606.12780v1 Announce Type: cross Abstract: Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and de…

arXiv cs.CL TIER_1 English(EN) · Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen · 2026-06-12 04:00

HyperTool：超越分步调用，实现工具增强型代理

arXiv:2606.13663v1 Announce Type: new Abstract: Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally de…

arXiv cs.CL TIER_1 English(EN) · Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah · 2026-06-12 04:00

递归代理（Recursive Agent）的利用

arXiv:2606.13643v1 Announce Type: new Abstract: Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anth…

arXiv cs.CL TIER_1 English(EN) · Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du · 2026-06-12 04:00

SkillCAT：对比评估与拓扑感知技能自演化用于LLM代理

arXiv:2606.13317v1 Announce Type: new Abstract: Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, an…

arXiv cs.AI TIER_1 English(EN) · Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi · 2026-06-12 04:00

多轮对话中更不安全：工具使用型智能体的多轮安全风险基准测试与防御

arXiv:2602.13379v2 Announce Type: replace-cross Abstract: LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse …

arXiv cs.AI TIER_1 English(EN) · Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li, Dacheng Tao, Tianwei Zhang · 2026-06-12 04:00

谁来买单？面向真实网络代理的以利益相关者为中心的提示注入基准测试

arXiv:2606.13385v1 Announce Type: cross Abstract: Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prom…

arXiv cs.AI TIER_1 English(EN) · Saehun Chun, Wonje Choi, Sera Choi, Sanghyun Ahn, Honguk Woo · 2026-06-12 04:00

功能性缓存嫁接：具身智能体的鲁棒且快速的代码策略合成

arXiv:2606.13097v1 Announce Type: cross Abstract: Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation i…

arXiv cs.AI TIER_1 English(EN) · Chejian Xu, Zhaorun Chen, Jingyang Zhang, Freddy Lecue, Avni Kothari, Sarah Tan, Wenbo Guo, Bo Li · 2026-06-12 04:00

MAStrike：基于 Shapley 的多智能体系统协同红队测试

arXiv:2606.12918v1 Announce Type: cross Abstract: Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engineering. In these systems, safety and security are inherently distributed across role-speci…

arXiv cs.AI TIER_1 English(EN) · Tarun Sharma · 2026-06-12 04:00

SMSR：持久化LLM代理系统运行时内存投毒的认证防御

arXiv:2606.12703v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted…

arXiv cs.AI TIER_1 English(EN) · Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein · 2026-06-12 04:00

保持策略梯度负责：基于兄弟指导的信用蒸馏用于长时程工具使用代理

arXiv:2606.12634v1 Announce Type: cross Abstract: Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing…

arXiv cs.AI TIER_1 English(EN) · Ruxue Shi, Yili Wang, Mengnan Du, Qinggang Zhang, Rui Miao, Yixin Liu, Xin Wang · 2026-06-12 04:00

SAIGuard：LLM多智能体系统的通信状态模拟以实现主动防御

arXiv:2606.12474v1 Announce Type: cross Abstract: LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS def…

arXiv cs.AI TIER_1 English(EN) · Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fengh… · 2026-06-12 04:00

Agents-K1：迈向原生智能体知识编排

arXiv:2606.13669v1 Announce Type: new Abstract: Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, …

arXiv cs.AI TIER_1 English(EN) · Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li · 2026-06-12 04:00

EurekAgent：Agent 环境工程是实现自主科学发现的全部所需

arXiv:2606.13662v1 Announce Type: new Abstract: LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results t…

arXiv cs.AI TIER_1 English(EN) · Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer… · 2026-06-12 04:00

AgentBeats：为开放性、标准化和可复现性而进行的Agent评估Agent化

arXiv:2606.13608v1 Announce Type: new Abstract: Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair compar…

arXiv cs.AI TIER_1 English(EN) · King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang · 2026-06-12 04:00

多智能体编排的奖励建模

arXiv:2606.13598v1 Announce Type: new Abstract: Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We pro…

arXiv cs.AI TIER_1 English(EN) · Ali Elahi, Barbara Di Eugenio · 2026-06-12 04:00

具有聚合置信度信号的多智能体协议

arXiv:2606.13591v1 Announce Type: new Abstract: Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior wor…

arXiv cs.AI TIER_1 English(EN) · Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty · 2026-06-12 04:00

多智能体优势的幻觉

arXiv:2606.13003v1 Announce Type: new Abstract: Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this …

arXiv cs.AI TIER_1 English(EN) · Jetlir Duraj, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi Zhou · 2026-06-12 04:00

迭代优化搜索：用于评估电子商务中Agentic搜索架构的双智能体模拟框架

arXiv:2606.12924v1 Announce Type: new Abstract: We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeab…

arXiv cs.AI TIER_1 English(EN) · Xiaoxuan Wang, Haixin Wang, Alexander Taylor, Jason Cong, Yizhou Sun, Wei Wang · 2026-06-12 04:00

HarnessBridge: LLM Agent Harness 的可学习双向控制器

arXiv:2606.12882v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and environment design, but also by the harness that mediates agent--environment interact…

arXiv cs.AI TIER_1 English(EN) · Renmin Cheng (The Hong Kong University of Science,Technology), Changhao Chen (The Hong Kong University of Science,Technology) · 2026-06-12 04:00

WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning

arXiv:2606.12852v1 Announce Type: new Abstract: Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. Despite their promise, low-level controllers often become perfo…

arXiv cs.AI TIER_1 English(EN) · Woong Shin, Craig A. Bridges, Marshall T. McDonnell, Rafael Ferreira da Silva · 2026-06-12 04:00

绝佳的科学智能体及其构建方法：用于 Rietveld 精修的 AgentBuild

arXiv:2606.12834v1 Announce Type: new Abstract: As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning, reinforcement learning, and prompt-and-go, bury the scientist's judgment. We propose treating…

arXiv cs.AI TIER_1 English(EN) · Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao · 2026-06-12 04:00

Evoflux：紧凑型代理的可执行工具工作流的推理时演化

arXiv:2606.12674v1 Announce Type: new Abstract: Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve…

arXiv cs.AI TIER_1 English(EN) · Vasily Ilin · 2026-06-12 00:45

形式化数值分析：超越内核接受的代理管道和质量审计

Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both lim…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-12 00:00

HarnessX：一个可组合、自适应、可演进的智能体Harness铸造厂

HarnessX enables adaptive and evolvable AI agent runtime interfaces through compositional primitives, trace-driven evolution, and feedback loops that improve both harness design and model training.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 17:59

InterleaveThinker：增强代理交错生成

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applic…

arXiv cs.AI TIER_1 English(EN) · Lei Bai · 2026-06-11 17:58

Agents-K1：迈向面向智能体的知识编排

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechani…

arXiv cs.CL TIER_1 English(EN) · Siheng Chen · 2026-06-11 17:56

HyperTool：超越分步工具调用，赋能工具增强型智能体

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into rep…

arXiv cs.AI TIER_1 English(EN) · Juanzi Li · 2026-06-11 17:56

EurekAgent：自主科学发现只需Agent环境工程

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As mod…

arXiv cs.CL TIER_1 English(EN) · Juanzi Li · 2026-06-11 17:56

EurekAgent：自主科学发现只需Agent环境工程

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As mod…

arXiv cs.CL TIER_1 English(EN) · Vamse Kumar Subbiah · 2026-06-11 17:47

递归代理（Recursive Agent）的利用

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the…

arXiv cs.AI TIER_1 English(EN) · Dawn Song · 2026-06-11 17:23

AgentBeats：为开放性、标准化和可复现性而进行的Agent评估Agent化

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root prob…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Hao Wang · 2026-06-11 17:16

多智能体编排的奖励建模

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a s…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Qing Qu · 2026-06-11 17:12

看见我所见，知晓我所想：异构智能体间的密集潜在通信

Multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising alternative, yet most prior work is homogeneous, using duplicate copies of the same model, and avoids the central challenge of cross-…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Barbara Di Eugenio · 2026-06-11 17:12

具有聚合置信度信号的多智能体协议

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD)…

arXiv cs.AI TIER_1 English(EN) · Tianwei Zhang · 2026-06-11 14:12

谁来买单？面向真实网络代理的以利益相关者为中心的提示注入基准测试

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign co…

arXiv cs.CL TIER_1 English(EN) · Bo Du · 2026-06-11 13:12

SkillCAT：用于 LLM 智能体的对比评估和拓扑感知技能自我演化

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. W…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Shafiq Joty · 2026-06-11 07:39

多智能体优势的幻觉

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS b…

arXiv cs.CL TIER_1 English(EN) · Kexin Ding, Yang Zhou, Can Jin, Feng Tong, Mu Zhou, Dimitris N. Metaxas · 2026-06-11 04:00

Agent技能评估与演进：框架与基准

arXiv:2606.11435v1 Announce Type: new Abstract: The growth of agent skills has transformed how agentic systems are built, evaluated, and deployed. As skill libraries continue to scale, rigorous evaluation becomes critical to ensuring their utility, quality, and safety in real-wor…

arXiv cs.CL TIER_1 English(EN) · Shi Liu, Jiayao Chen, Chengwei Qin, Yanqing Hu, Jufan Zhang, Linyi Yang · 2026-06-11 04:00

Notes2Skills：从实验室笔记到确定性感知科学代理技能

arXiv:2606.11897v1 Announce Type: new Abstract: Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientifi…

arXiv cs.CL TIER_1 English(EN) · Andrew Semenov, Svyatoslav Dorofeev · 2026-06-11 04:00

超越压缩：面向长时域智能体的结构化上下文驱逐

arXiv:2606.11213v1 Announce Type: new Abstract: We present Context Window Lifecycle (CWL), a context-management scheme that gives long-horizon LLM agents an effectively unbounded working horizon. As a session accumulates history, CWL keeps the context within budget through gradua…

arXiv cs.AI TIER_1 English(EN) · Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt · 2026-06-11 04:00

利用时间灵活性预计算多智能体路径重规划

arXiv:2601.04884v3 Announce Type: replace Abstract: Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not …

arXiv cs.AI TIER_1 English(EN) · Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang · 2026-06-11 04:00

使用扩散模型改进离线多智能体强化学习的泛化能力和数据效率

arXiv:2307.01472v2 Announce Type: replace Abstract: We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expr…

arXiv cs.AI TIER_1 English(EN) · Jiachun Li, Zhuoran Jin, Tianyi Men, Yupu Hao, Kejian Zhu, Lingshuai Wang, Dongqi Huang, Longxiang Wang, Shengjia Hua, Lu Wang, Jinshan Gao, Hongbang Yuan, Ruilin Xu, Kang Liu, Jun Zhao · 2026-06-11 04:00

面向大型语言模型的Agent环境工程：环境建模、合成、评估与应用综述

arXiv:2606.12191v1 Announce Type: cross Abstract: Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing wor…

arXiv cs.AI TIER_1 English(EN) · Sawyer Zhang, Alexander Wang, Sophie Lei · 2026-06-11 04:00

层隔离评估：使用无LLM、回归锁定的测试平台门控生产LLM代理的确定性脚手架

arXiv:2606.11686v1 Announce Type: cross Abstract: End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed …

arXiv cs.AI TIER_1 English(EN) · Tu Lan, Chaowei Xiao · 2026-06-11 04:00

运行时技能审计：面向代理技能安全性的定向运行时探测

arXiv:2606.11671v1 Announce Type: cross Abstract: Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only whe…

arXiv cs.AI TIER_1 English(EN) · Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu · 2026-06-11 04:00

ISE：多轮OS-Agent轨迹的基于执行的食谱

arXiv:2606.11520v1 Announce Type: cross Abstract: Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -…

arXiv cs.AI TIER_1 English(EN) · Lingzhi Yuan, Chenghao Deng, Fangxu Yu, Souradip Chakraborty, Mohammad Rostami, Furong Huang · 2026-06-11 04:00

FlowBank：通过预计算和重用优化查询自适应的代理工作流

arXiv:2606.11290v1 Announce Type: cross Abstract: Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy on…

arXiv cs.AI TIER_1 English(EN) · Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang · 2026-06-11 04:00

SkillJuror：衡量代理技能组织如何改变运行时行为

arXiv:2606.11543v1 Announce Type: new Abstract: Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive …

arXiv cs.AI TIER_1 English(EN) · Adithya Srinivasan, Devesh Paragiri · 2026-06-11 04:00

面向长时域研究代理的搜索方法

arXiv:2606.11522v1 Announce Type: new Abstract: Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific va…

arXiv cs.AI TIER_1 English(EN) · Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou · 2026-06-11 04:00

INFRAMIND: 基础设施感知多智能体编排

arXiv:2606.11440v1 Announce Type: new Abstract: Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the se…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

看见我所见，知晓我所想：异构智能体间的密集潜在通信

Heterogeneous multi-agent systems can effectively transfer knowledge through aligned KV-cache communication, achieving better performance than text-based methods with reduced computational costs.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

HarnessBridge: LLM Agent Harness 的可学习双向控制器

Learnable harness controller called HarnessBridge is introduced to parameterize agent-environment interfaces through bidirectional projections, achieving performance comparable to specialized harnesses with reduced computational overhead.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

EurekAgent：自主科学发现只需Agent环境工程

Environment engineering enhances autonomous scientific discovery by designing structured agent environments that optimize behaviors like exploration and collaboration while mitigating issues such as reward hacking and human oversight friction, as demonstrated by the EurekAgent sy…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 00:00

InterleaveThinker：增强代理交错生成

InterleaveThinker enables interleaved generation capabilities for image generators through a multi-agent pipeline with planner and critic agents, achieving performance comparable to state-of-the-art models while enhancing reasoning benchmarks.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Ozlem Ozmen Garibay · 2026-06-10 21:55

更聪明的破坏者，更优秀的修复者：线性多智能体工作流的可扩展性与安全性

As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS wor…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Olly Styles · 2026-06-10 21:21

WorkBench 回顾：两年后的工作场所智能体

The best agent on WorkBench in March 2024, GPT-4, completed 43% of tasks and took an unintended harmful action, such as emailing the wrong person, on 26% of them. We re-visit the benchmark in June 2026 and find that the best agent to date, Claude Opus 4.8, completes 89% and takes…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 17:47

APPO：Agentic Procedural Policy Optimization

Agentic Reinforcement Learning method that improves multi-turn tool-use capabilities by refining branching decisions and credit assignment through fine-grained decision points and procedure-level advantage scaling.

arXiv cs.AI TIER_1 English(EN) · Jun Zhao · 2026-06-10 15:15

面向大型语言模型的Agent环境工程：环境建模、合成、评估与应用综述

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analy…

arXiv cs.CL TIER_1 English(EN) · Linyi Yang · 2026-06-10 10:25

Notes2Skills：从实验室笔记到确定性感知科学代理技能

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 10:25

Notes2Skills：从实验记录本到确定性感知科学代理技能

Scientific discovery workflows usually contain and rely heavily on lab notes, where researchers record observations, interpret uncertain results, and plan follow-up experiments. Such informative lab notes preserve evolving scientific reasoning and author uncertainty, rather than …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 05:55

层隔离评估：使用无LLM、回归锁定的测试工具对生产LLM代理的确定性框架进行门控

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, dec…

arXiv cs.CL TIER_1 English(EN) · Sophie Lei · 2026-06-10 05:55

层隔离评估：使用无LLM、回归锁定的测试工具对生产LLM代理的确定性框架进行门控

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, dec…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Xin Wang · 2026-06-10 05:37

SAIGuard：LLM多智能体系统的通信状态模拟以实现主动防御

LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after exec…

arXiv cs.CL TIER_1 English(EN) · Yunan Lu, Ryan Shea, Yusen Zhang, Zhou Yu · 2026-06-10 04:00

VISTA：用于智能体评估的多功能交互式用户模拟工具包

arXiv:2606.11079v1 Announce Type: new Abstract: Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose…

arXiv cs.CL TIER_1 Deutsch(DE) · Jayoo Hwang, Xiaowen Zhang, Vedant Padwal · 2026-06-10 04:00

WebChallenger：一个可靠且高效的通用网络代理

arXiv:2606.10423v1 Announce Type: new Abstract: Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most use…

arXiv cs.AI TIER_1 English(EN) · Youjin Wang, Run Zhou, Yingjie Ma, Rong Fu, Jiani Liang, Shuaishuai Cao, Min Huang, Tao Fang, Liangming Pan · 2026-06-10 04:00

ASA：面向工具调用代理的无骨干训练表示工程

arXiv:2602.04935v3 Announce Type: replace-cross Abstract: Adapting LLM agents to domain-specific tool calling remains notably brittle under evolving interfaces. Prompt and schema engineering is easy to deploy but often fragile under distribution shift and strict parsers, while co…

arXiv cs.AI TIER_1 English(EN) · Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar · 2026-06-10 04:00

面向LLM智能体的 fakt-augmented 前瞻性规划

arXiv:2506.09171v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insuffic…

arXiv cs.AI TIER_1 English(EN) · Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang · 2026-06-10 04:00

BadRobot：在物理世界中越狱具身大模型代理

arXiv:2407.20242v5 Announce Type: replace-cross Abstract: Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitati…

arXiv cs.AI TIER_1 English(EN) · Valerie Chen, Rohit Malhotra, Xingyao Wang, Juan Michelini, Xuhui Zhou, Aditya Bharat Soni, Hoang H. Tran, Calvin Smith, Ameet Talwalkar, Graham Neubig · 2026-06-10 04:00

如何评估人机交互？软件代理设计案例研究

arXiv:2510.09801v3 Announce Type: replace Abstract: While benchmarks measure the accuracy of LLM-powered agents, they mostly assume full automation, failing to represent the collaborative nature of real-world use cases. In this paper, we make two major steps towards the rigorous …

arXiv cs.AI TIER_1 English(EN) · Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao, Shikhhar Siingh, Houhan Lu, Nadia Bathaee, Sriharsha Hatwar, Paresh Dashore, Anmol Jain, Kshitij Tayal, Xiuzhu Lin, Anirban Das, Sambit Sahu, Shi-Xiong Zhang · 2026-06-10 04:00

T1-Bench：在真实领域中对多场景智能体进行基准测试

arXiv:2606.11070v1 Announce Type: cross Abstract: Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain dive…

arXiv cs.AI TIER_1 English(EN) · Yuzhen Mao, Azalia Mirhoseini · 2026-06-10 04:00

具有共享上下文的去中心化多智能体系统

arXiv:2606.10662v1 Announce Type: cross Abstract: Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work,…

arXiv cs.AI TIER_1 English(EN) · Yuchen Ling, Shengcheng Yu, Zhenyu Chen, Chunrong Fang · 2026-06-10 04:00

迈向安全的LLM智能体：威胁面、攻击、防御与评估

arXiv:2606.10749v1 Announce Type: cross Abstract: Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security …

arXiv cs.AI TIER_1 English(EN) · Sirui Liang, Bohan Yu, Peiyu Wang, Shiguang Guo, Wenxing Hu, Pengfei Cao, Jian Zhao, Cao Liu, Ke Zeng, Xunliang Cai, Kang Liu · 2026-06-10 04:00

STAGE-Claw：面向真实场景的自动化基于状态的代理基准测试

arXiv:2606.10394v1 Announce Type: new Abstract: Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse s…

arXiv cs.AI TIER_1 English(EN) · Abhilasha Lodha, Mahsa Pahlavikhah Varnosfaderani, Abir Chakraborty, Abhinav Mithal · 2026-06-10 04:00

更少上下文，更好的智能体：长时域使用工具的 LLM 智能体的有效上下文工程

arXiv:2606.10209v1 Announce Type: new Abstract: Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise systems can cause context overflow, stale-state errors, and high inference cost. We study this…

arXiv cs.AI TIER_1 English(EN) · Junli Zha, Jinbo Wang, Chao Zhou, Xiang Song · 2026-06-10 04:00

Trace2Policy：从专家行为轨迹到自进化决策代理

arXiv:2606.10457v1 Announce Type: new Abstract: Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis. We present \textbf{Trace2Policy}, whose core mech…

arXiv cs.AI TIER_1 English(EN) · Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert \c{C}elikok, Lukas Galke Poech · 2026-06-10 04:00

仲裁者代理：持续监控多代理对话以检测新兴的错位

arXiv:2606.10747v1 Announce Type: new Abstract: As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned …

arXiv cs.AI TIER_1 English(EN) · Huanshuo Dong, Keyao Zhang, Hong Wang, Zhezheng Hao, Zhiwei Zhuang, Ziyan Liu, Jiacong Wang, Gengyuan Liu, Xin Jin · 2026-06-10 04:00

AutoPDE：通过显式表示的求解器策略实现可靠的代理式PDE求解

arXiv:2606.10752v1 Announce Type: new Abstract: Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decision…

arXiv cs.AI TIER_1 English(EN) · Xucong Wang, Ziyu Ma, Shidong Yang, Tongwen Huang, Pengkun Wang, Yong Wang, Xiangxiang Chu · 2026-06-10 04:00

Role-Agent：通过双重角色演进引导 LLM Agent

arXiv:2606.10917v1 Announce Type: new Abstract: Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalizat…

arXiv cs.AI TIER_1 English(EN) · Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin, Seth Donoughe · 2026-06-10 04:00

ABC-Bench：生物安全领域的代理生物能力基准测试

arXiv:2606.11150v1 Announce Type: new Abstract: Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks tha…

arXiv cs.AI TIER_1 English(EN) · Yijia Shao, Zora Zhiruo Wang, Neel Ahuja, Yicheng Wang, Bowen Liu, Diyi Yang · 2026-06-10 04:00

CollabSkill: 评估真实世界任务中的人机协作

arXiv:2606.09833v1 Announce Type: cross Abstract: AI agents are reshaping the workspace, leading to drastic change of how humans work. Despite the considerable potential of human-agent collaboration both in preserving human agency and generating economic value, this paradigm rema…

arXiv cs.AI TIER_1 English(EN) · Sawyer Zhang, Alexander Wang, Sophie Lei · 2026-06-10 04:00

五分之一的捕获率：LLM作为裁判在生产环境多轮交易代理中的盲点

arXiv:2606.10315v1 Announce Type: cross Abstract: LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-bevera…

arXiv cs.AI TIER_1 English(EN) · David Hofer, Edoardo Debenedetti, Florian Tram\`er · 2026-06-10 04:00

评估代理环境中的自动化提示注入攻击

arXiv:2606.10525v1 Announce Type: cross Abstract: Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings. We pr…

arXiv cs.LG TIER_1 English(EN) · Laksh Advani · 2026-06-10 04:00

从自信收尾到悄然失败：LLM代理中虚假成功的特征分析

arXiv:2606.09863v1 Announce Type: new Abstract: LLM agents can fail silently by asserting task completion when the environment state shows otherwise. We study this failure mode, false success, across two agent benchmarks: 9,876 tau2-bench trajectories from 8 model families and 1,…

arXiv cs.CL TIER_1 English(EN) · Shuwen Xu (May), Zhitao He (May), Yi R. (May), Fung · 2026-06-10 04:00

RedAct：用于程序技能保护的红队代理能力追踪

arXiv:2606.10813v1 Announce Type: cross Abstract: Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 01:11

SkillJuror：衡量代理技能组织如何改变运行时行为

Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points age…

Hugging Face Daily Papers TIER_1 Italiano(IT) · 2026-06-10 00:00

Orchestra-o1: 全模态代理编排

An omnimodal agent orchestration framework is presented that enables efficient collaboration across multiple modalities through unified task decomposition and specialized sub-agent execution, achieving superior performance on complex multimodal benchmarks.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

Evoflux：紧凑型代理的可执行工具工作流的推理时演化

Evoflux enables compact language models to execute tool workflows more reliably by using evolutionary search to repair failed plans during inference, significantly improving execution feasibility compared to traditional fine-tuning methods.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

面向大型语言模型的Agent环境工程：环境建模、合成、评估与应用综述

Large language model agents require specialized environments for training and evaluation, which can be categorized by their engineering lifecycle stages and evolved through various paradigms including neural and symbolic approaches.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

RedAct：用于程序技能保护的红队代理能力追踪

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills,…

arXiv cs.CL TIER_1 English(EN) · Lewei Lu · 2026-06-09 23:44

ISE：多轮OS-Agent轨迹的执行基础食谱

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that …

arXiv cs.AI TIER_1 English(EN) · Seth Donoughe · 2026-06-09 17:35

ABC-Bench：生物安全领域的代理生物能力基准测试

Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologis…

arXiv cs.CL TIER_1 English(EN) · Zhou Yu · 2026-06-09 16:39

VISTA：一个通用的交互式用户模拟工具包，用于Agent评估

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation…

arXiv cs.CL TIER_1 English(EN) · Shi-Xiong Zhang · 2026-06-09 16:32

T1-Bench：在真实领域中对多场景智能体进行基准测试

Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that…

arXiv cs.AI TIER_1 English(EN) · Xiangxiang Chu · 2026-06-09 14:28

Role-Agent：通过双重角色演进引导 LLM Agent

Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper in…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 14:28

Role-Agent：通过双重角色演进引导 LLM Agent

Role-Agent framework enables LLM agents to function as both agent and environment through bootstrapped co-evolution, improving performance via environment-aware reasoning and targeted practice.

arXiv cs.CL TIER_1 English(EN) · Fung · 2026-06-09 12:57

RedAct：用于程序技能保护的红队代理能力追踪

Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills,…

arXiv cs.AI TIER_1 English(EN) · Xin Jin · 2026-06-09 12:02

AutoPDE：通过显式表示的求解器策略实现可靠的代理式PDE求解

Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver co…

arXiv cs.AI TIER_1 English(EN) · Chunrong Fang · 2026-06-09 12:01

迈向安全的LLM智能体：威胁面、攻击、防御与评估

Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer …

arXiv cs.AI TIER_1 English(EN) · Lukas Galke Poech · 2026-06-09 11:57

仲裁者代理：持续监控多代理对话以检测新兴的错位

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise fro…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Azalia Mirhoseini · 2026-06-09 10:13

具有共享上下文的去中心化多智能体系统

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the numb…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Sumit Gulwani · 2026-06-09 08:14

SkillAxe：通过评估引导的自我完善来磨练 LLM 生成的代理技能

Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage point…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Sumit Gulwani · 2026-06-09 08:14

SkillAxe：通过评估指导的自我完善来磨练 LLM 生成的代理技能

Skill documents, structured natural-language instructions that guide Large Language Model (LLM) agents, are critical to modern agent frameworks, yet LLMs struggle to write skills that actually work. On SkillsBench, human-authored skills improve pass rates by 16.2 percentage point…

arXiv cs.AI TIER_1 English(EN) · Florian Tramèr · 2026-06-09 07:54

评估代理环境中的自动化提示注入攻击

Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--proven effective for jailbreaking--remain underexplored in realistic agentic settings. We present a comprehensive empirical evaluation of auto…

arXiv cs.CL TIER_1 Deutsch(DE) · Vedant Padwal · 2026-06-09 04:53

WebChallenger：一个可靠且高效的通用网络代理

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficie…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 04:16

STAGE-Claw：面向真实场景的自动化基于状态的代理基准测试

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progr…

arXiv cs.AI TIER_1 English(EN) · Po-Ya Angela Wang, Chinmaya Mishra, Asl{\i} \"Ozy\"urek, Paula Rubio-Fern\'andez, Esam Ghaleb · 2026-06-09 04:00

对齐但非特定伙伴：区分多模态大语言模型代理如何在无类人惯例的参考游戏中取得成功

arXiv:2606.08081v1 Announce Type: cross Abstract: Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become …

arXiv cs.AI TIER_1 English(EN) · Suchismita Naik, Samir Passi, Mihaela Vorvoreanu, Scott Saponas, Amanda Hall · 2026-06-09 04:00

“这里存在一个两难困境”：早期构建多智能体LLM系统的用户如何构想透明度

arXiv:2606.08323v1 Announce Type: cross Abstract: Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defined in these distributed architectures, which have complexities of inter-agent coordination …

arXiv cs.AI TIER_1 English(EN) · Deepak Akkil, Ravi Kokku, Karthik Vikram, Tamer Abuelsaad, Aditya Vempaty, Satya Nitta · 2026-06-09 04:00

Emergence World：一个用于评估长时域多智能体自主性的平台

arXiv:2606.08367v1 Announce Type: cross Abstract: Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant …

arXiv cs.AI TIER_1 English(EN) · Zhengyi Zhuo, Yan Liu · 2026-06-09 04:00

通过启动一次狂野的代码理解之旅来预测SWE Agent的新兴思维模式

arXiv:2606.08500v1 Announce Type: cross Abstract: Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool…

arXiv cs.AI TIER_1 English(EN) · Yuhan Ma, Stefan Schmid · 2026-06-09 04:00

SecureClaw：夺回LLM代理的控制权

arXiv:2606.09549v1 Announce Type: cross Abstract: Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses …

arXiv cs.AI TIER_1 English(EN) · Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou · 2026-06-09 04:00

AGENTSERVESIM: 面向多轮大模型代理服务的硬件感知模拟器

arXiv:2606.09613v1 Announce Type: cross Abstract: Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and r…

arXiv cs.AI TIER_1 English(EN) · Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi · 2026-06-09 04:00

OmniGameArena: 统一的 UE5 VLM 游戏代理基准测试及其改进动态

arXiv:2606.09826v1 Announce Type: cross Abstract: Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo pla…

arXiv cs.AI TIER_1 English(EN) · Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu · 2026-06-09 04:00

基于大型语言模型游戏代理的调查研究

arXiv:2404.02039v5 Announce Type: replace Abstract: Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Re…

arXiv cs.AI TIER_1 English(EN) · Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh… · 2026-06-09 04:00

支付规模塑造了跨语言LLM代理的合作

arXiv:2601.19082v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central is…

arXiv cs.LG TIER_1 English(EN) · Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · 2026-06-09 04:00

过犹不及：衡量多轮、多语言大语言模型代理中的非法协助

arXiv:2602.16346v4 Announce Type: replace-cross Abstract: LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely…

arXiv cs.LG TIER_1 English(EN) · Najmul Hasan, Prashanth BusiReddyGari · 2026-06-09 04:00

GRPO 无法弥合多智能体协调的差距

arXiv:2606.07845v1 Announce Type: cross Abstract: We measure how well current large language models coordinate as multiple agents sharing a common resource, using the dining philosophers problem as a clean test bed. Across 630 episodes spanning seven models and three philosopher …

arXiv cs.LG TIER_1 English(EN) · Zhiwei Li, Yong Hu · 2026-06-09 04:00

SkillHone：通过持久化决策历史实现持续智能体技能演进的工具

arXiv:2606.08671v1 Announce Type: new Abstract: Agent skills extend language-model agents with task-specific procedures, scripts, and references, but the tasks and environments they target continually change. Existing methods improve skills in bounded runs and retain only the fin…

arXiv cs.LG TIER_1 English(EN) · Yi Xie, Zhanke Zhou, Chentao Cao, Bo Liu, Bo Han · 2026-06-09 04:00

DICE：用于稳定多智能体LLM协调的熵正则化均衡选择

arXiv:2606.08068v1 Announce Type: new Abstract: Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current sys…

arXiv cs.AI TIER_1 English(EN) · Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song · 2026-06-09 04:00

ANNEAL：通过受控符号补丁学习适应 LLM 代理

arXiv:2605.16309v2 Announce Type: replace Abstract: LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self…

arXiv cs.AI TIER_1 English(EN) · Xinyu Zhu, Yuzhu Cai, Zexi Liu, Cheng Wang, Fengyang Li, Wenkai Jin, Wanxu Liu, Zehao Bing, Bingyang Zheng, Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xianghe Pang, Yaxin Du, Tingjia Miao, Yuzhi Zhang, Ruoxue Liao, Zhaohan Ding, Linfeng Zhang, Yanfeng Wan… · 2026-06-09 04:00

EvoMaster：面向大规模Agentic科学的、可进化的基础Agent框架

arXiv:2604.17406v3 Announce Type: replace Abstract: The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific method is inherently iterative, existing agent frameworks are predominantly static, narro…

arXiv cs.AI TIER_1 English(EN) · Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu · 2026-06-09 04:00

超越古德哈特定律：多智能体系统合规性评估的动态基准

arXiv:2606.07805v1 Announce Type: new Abstract: The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading…

arXiv cs.AI TIER_1 English(EN) · Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim · 2026-06-09 04:00

通过代理间协议克服监管瓶颈：一项核能案例研究

arXiv:2606.07866v1 Announce Type: new Abstract: Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an…

arXiv cs.AI TIER_1 English(EN) · Rahul Suresh Babu, Laxmipriya Ganesh Iyer · 2026-06-09 04:00

Contract2Tool：为可靠的工具增强 LLM 代理学习先决条件和效果

arXiv:2606.07904v1 Announce Type: new Abstract: Tool-augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not when the tool is causally appropriate or what task state it produces. Causal tool filtering ad…

arXiv cs.AI TIER_1 English(EN) · Amine El Hattami, Nicolas Chapados, Christopher Pal · 2026-06-09 04:00

SKILL.nb：用于持久化代理工作流的选择性形式化和门控执行

arXiv:2606.08049v1 Announce Type: new Abstract: AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may…

arXiv cs.AI TIER_1 English(EN) · Zayx Shawn · 2026-06-09 04:00

PACE：面向自演化智能体的任何时候都有效的验收测试

arXiv:2606.08106v1 Announce Type: new Abstract: Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candida…

arXiv cs.AI TIER_1 English(EN) · Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham · 2026-06-09 04:00

在线代理即评判：交互式代理的态势生成评估

arXiv:2606.08200v1 Announce Type: new Abstract: Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typicall…

arXiv cs.AI TIER_1 English(EN) · Junyi Yao, Zihao Zheng · 2026-06-09 04:00

超越Agent架构：LLM交易系统的执行假设与可复现性

arXiv:2606.08285v1 Announce Type: new Abstract: Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains difficult to compare because studies vary in data provenance, temporal split discipline, execut…

arXiv cs.AI TIER_1 English(EN) · Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker, Alexander Rutherford, Davide Paglieri, Aidan Scannell, Henry Gouk, Elliot J. Crowley, Tim Rockt\"aschel, Amos Storkey · 2026-06-09 04:00

语言代理中开放式多智能体协调的基准测试

arXiv:2606.08340v1 Announce Type: new Abstract: As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising…

arXiv cs.AI TIER_1 English(EN) · Lu Jia, Haibo Tong, Feifei Zhao, Jindong Li, Dongqi Liang, Ping Wu, Qian Zhang, Yi Zeng · 2026-06-09 04:00

VESTA：用于 LLM Agent 的全自动场景生成和安全评估框架

arXiv:2606.08531v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autono…

arXiv cs.AI TIER_1 English(EN) · Mark Burgess · 2026-06-09 04:00

定量承诺理论：自主智能体的意向性与推理

arXiv:2606.08552v1 Announce Type: new Abstract: I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and othe…

arXiv cs.AI TIER_1 English(EN) · Adrian de Valois-Franklin, Alex Bogdan · 2026-06-09 04:00

RAILS：面向Agentic商业的验证原生清算

arXiv:2606.08790v1 Announce Type: new Abstract: Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is t…

arXiv cs.AI TIER_1 English(EN) · Cheonsu Jeong · 2026-06-09 04:00

Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

arXiv:2606.09039v1 Announce Type: new Abstract: This study proposes the Behavioral Protocol Framework (BPF), an entropy-controlled pluralistic alignment framework designed to address two critical challenges in autonomous agent economies: the hivemind effect arising from excessive…

arXiv cs.AI TIER_1 English(EN) · Xiaofeng Lin, Yingxu Wang, Tung Sum Thomas Kwok, Daniel Guo, Sahil Arun Nale, Charles Fleming, Guang Cheng · 2026-06-09 04:00

REFLECT：LLM代理轨迹中无声故障的干预支持错误归因

arXiv:2606.09071v1 Announce Type: new Abstract: Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a completed traces still lags far behind, especially in the \emph{silent failure} regime. Existing…

arXiv cs.AI TIER_1 English(EN) · Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He · 2026-06-09 04:00

Anything2Skill：将外部知识编译成代理可重用的技能

arXiv:2606.09316v1 Announce Type: new Abstract: Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manual…

arXiv cs.AI TIER_1 English(EN) · Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan · 2026-06-09 04:00

WeaveBench：面向混合界面的长时程、真实世界计算机使用代理基准测试

arXiv:2606.09426v1 Announce Type: new Abstract: Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as se…

arXiv cs.AI TIER_1 English(EN) · Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang · 2026-06-09 04:00

AliyunConsoleAgent: 通过蒸馏和强化学习在真实云环境中训练Web Agent

arXiv:2606.09447v1 Announce Type: new Abstract: We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to fr…

arXiv cs.AI TIER_1 English(EN) · Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen, Zhiqiang Zhang, Jun Zhou · 2026-06-09 04:00

SearchSwarm：迈向Agentic LLM中用于长时深度研究的委托智能

arXiv:2606.09730v1 Announce Type: new Abstract: Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where…

arXiv cs.AI TIER_1 English(EN) · Arsalan Shahid, Gordon Suttie, Philip Black · 2026-06-09 04:00

协作式人机协议 (CHAP)

arXiv:2606.09751v1 Announce Type: new Abstract: Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects custome…

arXiv cs.AI TIER_1 English(EN) · Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin · 2026-06-09 04:00

SIGA：用于科学仿真的自演化编码代理适配器

arXiv:2606.09774v1 Announce Type: new Abstract: Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-…

arXiv cs.AI TIER_1 English(EN) · Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss · 2026-06-09 04:00

从人工指导到自主：面向空间NPU上端到端LLM部署的Agent技能系统

arXiv:2606.07586v1 Announce Type: cross Abstract: Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lowe…

arXiv cs.AI TIER_1 English(EN) · Bowen Ren, Heyan Huang, Yinghao Li, Yang Gao · 2026-06-09 04:00

MetaEvo：一种用于体验驱动的智能体进化的元优化框架

arXiv:2606.07603v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or heur…

arXiv cs.AI TIER_1 English(EN) · Veronique Ziegler · 2026-06-09 04:00

IRAM-Omega-Q：用于自适应智能体不确定性调节的计算框架

arXiv:2603.16020v2 Announce Type: replace Abstract: Adaptive agents operating under uncertainty must do more than optimize task outputs: they must maintain a workable internal state under noise, perturbation, and changing conditions. This paper introduces IRAM-Omega-Q, a computat…

arXiv cs.AI TIER_1 English(EN) · Rishi Desai, Jesse Hu, Joan Cabezas, Neel Harsola, Pratyush Shukla, Roey Ben Chaim, Adnan El Assadi, Omkaar Mukund Kamath, Fenil Faldu, Prannay Hebbar, Jiankai Sun, Yiyuan Li, Pramod Srinivasan, Ishan Gupta, Christopher Settles, Daniel Wang, Derek Chen, … · 2026-06-09 04:00

SWE-Marathon：智能体能否自主完成超长周期软件工作？

arXiv:2606.07682v1 Announce Type: cross Abstract: AI agents are increasingly expected to complete long-horizon workflows that require sustained progress over hours, millions of tokens, and complex environments. Yet current agent benchmarks largely evaluate short-form tasks, such …

arXiv cs.AI TIER_1 English(EN) · Faisal Fareed · 2026-06-09 04:00

面向 LLM 代理工作流的成本感知推测执行：一种集成的五维方法

arXiv:2606.07846v1 Announce Type: cross Abstract: LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a d…

arXiv cs.AI TIER_1 English(EN) · Jaineet Shah · 2026-06-09 04:00

因果代理回放：LLM-代理失败的反事实归因

arXiv:2606.08275v1 Announce Type: cross Abstract: When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. Th…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Foutse Khomh · 2026-06-09 02:18

面向LLM鲁棒上下文推理的博弈论多智能体控制

Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning traje…

arXiv cs.CL TIER_1 English(EN) · Sophie Lei · 2026-06-09 02:11

五分之一的捕获率：LLM作为裁判在生产环境多轮交易代理中的盲点

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine qua…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

具有共享上下文的去中心化多智能体系统

Decentralized Language Models (DeLM) framework enables scalable large language model reasoning through parallel agents that asynchronously coordinate via a shared verified context, improving performance and efficiency over centralized approaches.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 00:00

仲裁者代理：持续监控多代理对话以检测新兴的错位

A multi-agent system monitoring framework identifies misaligned behavior through real-time inspection with resource constraints, demonstrating effective detection of misalignment types under various conditions.

Hugging Face Daily Papers TIER_1 Deutsch(DE) · 2026-06-09 00:00

WebChallenger：一个可靠且高效的通用网络代理

WebChallenger presents a web agent framework that improves autonomous navigation through structured page representation and cognitive-inspired mechanisms, achieving high performance with open-weight models.

arXiv cs.AI TIER_1 English(EN) · Xiaojuan Qi · 2026-06-08 17:59

OmniGameArena: 统一的 UE5 VLM 游戏代理基准测试及改进动态

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heter…

arXiv cs.AI TIER_1 English(EN) · Lianhui Qin · 2026-06-08 17:35

SIGA：用于科学模拟的自演化编码代理适配器

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator…

arXiv cs.AI TIER_1 English(EN) · Philip Black · 2026-06-08 17:11

协作式人机协议 (CHAP)

Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisi…

arXiv cs.AI TIER_1 English(EN) · Jun Zhou · 2026-06-08 16:52

SearchSwarm：迈向Agentic LLM中用于长周期深度研究的委托智能

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches su…

arXiv cs.AI TIER_1 English(EN) · Qian Lou · 2026-06-08 15:20

AGENTSERVESIM: 面向多轮大模型代理服务的硬件感知模拟器

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, in…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 15:20

AGENTSERVESIM: 面向多轮大模型代理服务的硬件感知模拟器

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, in…

arXiv cs.AI TIER_1 English(EN) · Stefan Schmid · 2026-06-08 14:29

SecureClaw：夺回LLM代理的控制权

Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/r…

arXiv cs.AI TIER_1 English(EN) · Linquan Jiang · 2026-06-08 12:55

AliyunConsoleAgent: 通过蒸馏和强化学习在真实云环境中训练Web Agent

We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding docume…

arXiv cs.AI TIER_1 English(EN) · Caihua Shan · 2026-06-08 12:39

WeaveBench：面向混合界面的长时程、真实世界计算机使用代理基准测试

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross…

arXiv cs.CL TIER_1 English(EN) · Guanbin Li · 2026-06-08 04:58

弥合智能体-世界鸿沟：用于基于LLM的智能体的文本世界模型

Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these …

arXiv cs.CL TIER_1 English(EN) · Xintao Wang, Sirui Zheng, Hongqiu Wu, Weiyuan Li, Jen-tse Huang, Minghao Zhu, Can Zu, Qi Deng, Jiawei Wang, Qianyu He, Heng Wang, Xiaojian Wu, Yunzhe Tao · 2026-06-08 04:00

Agentopia：Agent社会中的长期生活模拟与学习

arXiv:2606.07513v1 Announce Type: new Abstract: Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand …

arXiv cs.CL TIER_1 English(EN) · Shubham Gaur, Ian Lane · 2026-06-08 04:00

面向长视界网络代理的信号驱动观测

arXiv:2606.06708v1 Announce Type: new Abstract: Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks complet…

arXiv cs.AI TIER_1 English(EN) · Linyao Chen, Bo Huang, Qinlao Zhao, Shuai Shao, Zhi Han, Zicai Cui, Ziheng Zhang, Guangtao Zeng, Wenzheng Tang, Yikun Wang, Yuanjian Zhou, Zimian Peng, Yong Yu, Weiwen Liu, Hiroki Kobayashi, Weinan Zhang · 2026-06-08 04:00

SW-$A^2$-Bench：为 Agentic Web 基准测试自主软件代理生成

arXiv:2604.04226v2 Announce Type: replace-cross Abstract: The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accomplish user goals. However, the capacity of Agentic Web is still limited by insufficie…

arXiv cs.AI TIER_1 English(EN) · Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr B{\l}aszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H. S. Torr, Adam Mahdi, Adel Bibi · 2026-06-08 04:00

是陷阱！用于网络代理的任务重定向代理说服基准

arXiv:2512.23128v2 Announce Type: replace-cross Abstract: Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injecti…

arXiv cs.AI TIER_1 English(EN) · Chuan Xiao, Zhengbo Jiao, Shaobo Wang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang, Lin Qu · 2026-06-08 04:00

Socratic-SWE：通过追踪派生代理技能实现自进化编码代理

arXiv:2606.07412v1 Announce Type: cross Abstract: LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typic…

arXiv cs.AI TIER_1 English(EN) · Haoran Xu, Lei Zhang, Iadh Ounis, Xianbin Wang · 2026-06-08 04:00

面向拜占庭容错大模型-Agent协作的分层认证语义承诺

arXiv:2606.07316v1 Announce Type: cross Abstract: Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what…

arXiv cs.AI TIER_1 English(EN) · Dutao Zhang, Liaotian · 2026-06-08 04:00

Queen-Bee Agents：一种以 BeeSpec 为中心的受管企业 MCP 编排架构

arXiv:2606.06545v1 Announce Type: cross Abstract: Enterprise agent systems increasingly need to connect large language models to private tools, internal knowledge, and Model Context Protocol (MCP) interfaces. In this setting, raw task capability is insufficient: organizations als…

arXiv cs.AI TIER_1 English(EN) · Yuxuan Zhao, Sijia Chen, Ningxin Su · 2026-06-08 04:00

多智能体协作何时有益？一种熵视角

arXiv:2602.04234v6 Announce Type: cross Abstract: Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, spe…

arXiv cs.AI TIER_1 English(EN) · Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li, Weixian Shi, Yiqun Chen, Xuchen Ma, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Jianmin Wu, Dawei Yin · 2026-06-08 04:00

DuMate-DeepResearch：一个可审计的多智能体系统，具备递归搜索和基于评分标准的推理能力

arXiv:2606.07299v1 Announce Type: new Abstract: Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In pra…

arXiv cs.AI TIER_1 English(EN) · Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei · 2026-06-08 04:00

Foundation Model Agents 的 Sim-to-Real 鸿沟：一个统一的 MDP 视角

arXiv:2606.07017v1 Announce Type: new Abstract: Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community i…

arXiv cs.AI TIER_1 English(EN) · Yijin Zhou, Linqian Zeng, Xiaoya Lu, Wenyuan Xie, Dongrui Liu, Junchi Yan, Jing Shao · 2026-06-08 04:00

通过不确定性对齐强化学习探索智能体工具调用决策

arXiv:2606.06976v1 Announce Type: new Abstract: Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing appr…

arXiv cs.AI TIER_1 English(EN) · Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun · 2026-06-08 04:00

OpenSkill：LLM智能体的开放世界自我进化

arXiv:2606.06741v1 Announce Type: new Abstract: Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of …

arXiv cs.AI TIER_1 English(EN) · Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang · 2026-06-08 04:00

Lean4Agent：代理工作流和轨迹的形式化建模与验证

arXiv:2606.06523v1 Announce Type: new Abstract: Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal m…

arXiv cs.LG TIER_1 English(EN) · Farhad Rezazadeh, Amir Ashtari Gargari, Hatim Chergui, Sandra Lagen, Merouane Debbah, Houbing Song, Lingjia Liu · 2026-06-08 04:00

面向6G的Agentic世界建模：近乎实时的生成式状态空间推理

arXiv:2511.02748v2 Announce Type: replace-cross Abstract: We argue that sixth-generation (6G) intelligence is not fluent token prediction but the capacity to imagine and choose -- to simulate future scenarios, weigh trade-offs, and act with calibrated uncertainty. We reframe open…

arXiv cs.LG TIER_1 English(EN) · Yudi Zhang, Meng Fang, Zhenfang Chen, Mykola Pechenizkiy · 2026-06-08 04:00

具有分布内优化的自进化LLM代理

arXiv:2606.07367v1 Announce Type: new Abstract: Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key dif…

arXiv cs.CL TIER_1 English(EN) · Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang · 2026-06-08 04:00

AutoTool：Agentic推理的动态工具选择与集成

arXiv:2512.13278v2 Announce Type: replace Abstract: Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which lim…

arXiv cs.CL TIER_1 English(EN) · Zihao Deng, Yining Zhu, Leiming Wang, Jingfei Lu, Junbo Wang, Chuncheng Ran, Yu Yang, Dixuan Yang, Jikun Shen · 2026-06-08 04:00

Tree-of-Experience：低重复和隐式奖励环境下自进化智能体的结构化经验管理解决方案

arXiv:2606.06960v1 Announce Type: new Abstract: Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit reward…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

弥合智能体-世界鸿沟：用于基于LLM的智能体的文本世界模型

Text world models serve as transition models for LLM-based agents in interactive environments, enabling planning and efficient learning by predicting environmental changes from textual states and actions.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

OmniGameArena: 统一的 UE5 VLM 游戏代理基准测试及其改进动态

OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

SearchSwarm：迈向Agentic LLM中面向长周期深度研究的委托智能

A large language model trained on synthesized delegation intelligence achieves superior performance on long-horizon research tasks through task decomposition and subagent coordination.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-08 00:00

WeaveBench：面向混合界面的长时域、真实世界计算机使用代理基准测试

WeaveBench presents a comprehensive benchmark for evaluating computer-use agents across multiple interfaces, revealing significant challenges in long-horizon task orchestration and highlighting the limitations of traditional performance assessment methods.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jiaxuan Guo · 2026-06-07 23:26

PerspectiveGap：多智能体编排提示的基准测试

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability t…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Alex Bogdan · 2026-06-07 19:12

RAILS：面向Agentic商业的验证原生清算

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Mark Burgess · 2026-06-07 10:12

定量承诺理论：自主智能体的意向性与推理

I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian …

arXiv cs.AI TIER_1 English(EN) · Yi Zeng · 2026-06-07 09:23

VESTA：LLM代理的全自动场景生成与安全评估框架

Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory, use tools, access external environments, and execute tasks. As their capabilities and autonomy expand, the safety risks they face also becom…

arXiv cs.AI TIER_1 English(EN) · Yan Liu · 2026-06-07 07:57

通过启动一次狂野的代码理解之旅来预测SWE Agent的新兴思维模式

Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Satya Nitta · 2026-06-06 22:59

Emergence World：一个用于评估长时域多智能体自主性的平台

Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this approach is mismatched with the deployment conditions of autonomous systems, where the relevant timescale can be weeks to months, and where the dy…

arXiv cs.CL TIER_1 English(EN) · Jian Guo · 2026-06-06 21:40

Bayesian-Agent：后验引导技能演化，用于LLM Agent Harness

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes a…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Amos Storkey · 2026-06-06 21:13

语言智能体开放式多智能体协调基准测试

As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or high…

arXiv cs.AI TIER_1 English(EN) · Amanda Hall · 2026-06-06 20:24

“这里存在一个两难困境”：早期构建多智能体LLM系统的采用者如何构想透明度

Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defined in these distributed architectures, which have complexities of inter-agent coordination and orchestration. In this paper, we present one o…

arXiv cs.AI TIER_1 English(EN) · Zihao Zheng · 2026-06-06 18:14

超越Agent架构：LLM交易系统的执行假设与可复现性

Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains difficult to compare because studies vary in data provenance, temporal split discipline, execution timing, turnover treatment, and transaction-…

arXiv cs.AI TIER_1 English(EN) · Jaineet Shah · 2026-06-06 17:44

因果代理回放：LLM-代理失败的反事实归因

When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that exec…

arXiv cs.AI TIER_1 English(EN) · Donghoon Ham · 2026-06-06 14:37

在线代理即评判：交互式代理的境况生成评估

Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an envir…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Dexing Liu · 2026-06-06 13:22

LLM智能体系统中的无声失败：熵原理与自主智能体的必然混乱

Large Language Model (LLM) agent systems suffer from failures that occur without external triggers -- no injection, no adversarial input, no resource exhaustion. These silent failures -- unexpected deviations from intended behavior under normal conditions -- are routinely misattr…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Zayx Shawn · 2026-06-06 11:12

PACE：面向自进化智能体的任何时候都有效的验收测试

Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, th…

arXiv cs.CL TIER_1 English(EN) · Esam Ghaleb · 2026-06-06 10:05

对齐但非特定伙伴：区分多模态大语言模型代理如何在无类人惯例的参考游戏中取得成功

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Christopher Pal · 2026-06-06 08:27

SKILL.nb：用于持久化代理工作流的选择性形式化和门控执行

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified ta…

arXiv cs.AI TIER_1 English(EN) · Najmul Hasan, Prashanth BusiReddyGari · 2026-06-06 04:00

DPBench：多智能体LLM在同时资源争夺下的结构化决定因素

arXiv:2602.13255v2 Announce Type: replace Abstract: We present DPBench, a benchmark for evaluating coordination in multi-agent systems built from large language models. Existing benchmarks measure task-level success under a fixed protocol; the structural conditions under which co…

arXiv cs.AI TIER_1 English(EN) · Chen Huang, Yuhao Wu, Wenxuan Zhang · 2026-06-06 04:00

代理应该说什么？用于高效多代理系统的动作-状态通信

arXiv:2606.05304v1 Announce Type: new Abstract: Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However…

arXiv cs.AI TIER_1 English(EN) · Yuhang Fu, Ruishan Fang, Jiaqi Shao, Huiyu Zheng, Zhengtao Zhu, Bing Luo, Tao Lin · 2026-06-06 04:00

更多代理有帮助吗？LLM代理工作流的可控且符合协议的评估

arXiv:2606.05670v1 Announce Type: new Abstract: Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places…

arXiv cs.AI TIER_1 English(EN) · Chengqi Dong, Chuhuai Yue, Hang He, yandong liu, Fenghe Tang, S Kevin Zhou, Xiaohan Wang, Jiajun Chai, Guojun Yin · 2026-06-06 04:00

TAPO：通过信用转移实现多模态搜索代理的工具感知策略优化

arXiv:2606.05784v1 Announce Type: new Abstract: We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use …

arXiv cs.AI TIER_1 English(EN) · Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin · 2026-06-06 04:00

当工具失效时：LLM智能体动态重规划与异常恢复的基准测试

arXiv:2606.05806v1 Announce Type: new Abstract: Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR…

arXiv cs.AI TIER_1 English(EN) · Dianxing Shi, Junqi He, Junhao Chen, Bowen Wang, Yuta Nakashima · 2026-06-06 04:00

迈向健康演进：探索人机交互在自演化系统中的作用与机制

arXiv:2606.06114v1 Announce Type: new Abstract: Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static a…

arXiv cs.AI TIER_1 English(EN) · Patrick Wilhelm, Odej Kao · 2026-06-06 04:00

从奖励劫持激活到代理风险状态：LLM代理中的上下文校准机制监控

arXiv:2606.06223v1 Announce Type: new Abstract: Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style…

arXiv cs.AI TIER_1 English(EN) · Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman, Thierry Tambe · 2026-06-06 04:00

Agent Memory：有状态长时工作负载的特征描述与系统影响

arXiv:2606.06448v1 Announce Type: new Abstract: LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory acros…

arXiv cs.AI TIER_1 English(EN) · Lo\"is Vanh\'ee, Melania Borit · 2026-06-06 04:00

RAINO：在现实中锚定智能体，用于智能体建模现实性的系统性回顾与概念框架

arXiv:2606.05167v1 Announce Type: cross Abstract: Realism is a central yet seemingly under-theorized concept in Agent-Based Modelling. This paper presents a Systematic Literature Review, aiming to identify how realism is currently operationalized and demonstrated. The results sho…

arXiv cs.AI TIER_1 English(EN) · Shipi Dhanorkar, Samir Passi, Mihaela Vorvoreanu · 2026-06-06 04:00

实践中的代理式系统人工监督：考察使用软件代理的开发者的监督工作、挑战和启发式方法

arXiv:2606.05391v1 Announce Type: cross Abstract: Autonomous software agents hold promise to increase developer productivity but make mistakes and exhibit novel failure modes, making human oversight central to successful human-agent collaboration. Existing research on agent overs…

arXiv cs.AI TIER_1 English(EN) · Jintao Huang, Xiaomin Li, Gaurav Mittal, Yu Hu · 2026-06-06 04:00

ADK Arena：通过 LLM 作为开发者来评估 Agent 开发工具包

arXiv:2606.05548v1 Announce Type: cross Abstract: The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \tex…

arXiv cs.AI TIER_1 English(EN) · Eric Bridgeford, Hayden Helm · 2026-06-06 04:00

检测多智能体系统中的视角转变

arXiv:2512.05013v2 Announce Type: replace Abstract: Generative models augmented with external tools and update mechanisms (or \textit{agents}) have demonstrated capabilities beyond intelligent prompting of base models. As agent use proliferates, dynamic multi-agent systems have n…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-06 00:00

Bayesian-Agent: 后验引导技能演化，用于LLM Agent Harness

Bayesian-Agent presents a framework that treats reusable skills and SOPs as hypotheses for model success, using Bayesian inference to guide agent behavior and improve task performance through posterior-guided harness optimization.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Richard B. Vilim · 2026-06-05 21:54

通过代理间协议克服监管瓶颈：一项核能案例研究

Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that repl…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Faisal Fareed · 2026-06-05 21:13

面向 LLM-Agent 工作流的成本感知推测执行：一种集成的五维方法

LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream inpu…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Prashanth BusiReddyGari · 2026-06-05 21:13

GRPO 无法弥合多智能体协调的差距

We measure how well current large language models coordinate as multiple agents sharing a common resource, using the dining philosophers problem as a clean test bed. Across 630 episodes spanning seven models and three philosopher counts, four frontier closed-source systems reach …

arXiv cs.NE (Neural & Evolutionary) TIER_1 English(EN) · Alane Suhr · 2026-06-05 19:59

多智能体交互中的表征相似性与模型行为

Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining in…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Zenglin Xu · 2026-06-05 19:33

超越古德哈特定律：多智能体系统合规性评估的动态基准

The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents str…

arXiv cs.CL TIER_1 English(EN) · Yunzhe Tao · 2026-06-05 17:59

Agentopia：Agent社会中的长期生活模拟与学习

Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior age…

arXiv cs.AI TIER_1 English(EN) · Lin Qu · 2026-06-05 16:00

Socratic-SWE：通过追踪派生代理技能实现自演进编码代理

LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-in…

arXiv cs.LG TIER_1 English(EN) · Mykola Pechenizkiy · 2026-06-05 15:09

具有分布内优化的自演化大语言模型代理

Large Language Models (LLMs) have recently emerged as powerful controllers for interactive agents in complex environments, yet training them to perform reliable long-horizon decision making remains a fundamental challenge. A key difficulty lies in credit assignment: agents often …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Xianbin Wang · 2026-06-05 14:35

面向拜占庭容错大模型-Agent协作的分层认证语义承诺

Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggr…

arXiv cs.AI TIER_1 English(EN) · Dawei Yin · 2026-06-05 14:10

DuMate-DeepResearch：一个可审计的多代理系统，具备递归搜索和基于评分标准的推理能力

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrain…

arXiv cs.CL TIER_1 English(EN) · Hua Wei · 2026-06-05 08:00

Foundation Model Agents 的 Sim-to-Real 鸿沟：一个统一的 MDP 视角

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel…

arXiv cs.AI TIER_1 English(EN) · Jing Shao · 2026-06-05 07:08

通过不确定性对齐强化学习探索Agentic工具调用决策

Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through in…

arXiv cs.CL TIER_1 English(EN) · Jikun Shen · 2026-06-05 06:39

Tree-of-Experience：低重复和隐式奖励环境下自演化代理的结构化经验管理解决方案

Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit rewards, where past experience is difficult to reuse a…

arXiv cs.LG TIER_1 English(EN) · Kaixuan Liu, Guojun Xiong, Weinan Zhang, Shengpu Tang · 2026-06-05 04:00

用于LLM代理离线策略评估的自回归扩散世界模型

arXiv:2606.05558v1 Announce Type: new Abstract: Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framewo…

arXiv cs.CL TIER_1 English(EN) · Yang Li, Jiaxiang Liu, Jiang Cai, Mingkun Xu · 2026-06-05 04:00

AURA：面向隐式需求的意图导向探测，用于情境化大语言模型代理

arXiv:2606.05557v1 Announce Type: new Abstract: A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal qu…

arXiv cs.LG TIER_1 English(EN) · Yoga Sri Varshan Varadharajan, Bodun Hu, Saurabh Agarwal, Aditya Akella · 2026-06-05 04:00

CUCo：计算与通信协同设计的智能体框架

arXiv:2603.02376v2 Announce Type: replace-cross Abstract: Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep…

arXiv cs.LG TIER_1 English(EN) · Muhammad Talha Sharif, Abdul Rehman · 2026-06-05 04:00

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

arXiv:2606.05704v1 Announce Type: cross Abstract: Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning p…

arXiv cs.LG TIER_1 English(EN) · Oleeviya Babu Poikarayil, C\'edric Schockaert, Abdulrahman Nahhas, Christian Daase, Mursal Dawodi, Jawid Ahmad Baktash · 2026-06-05 04:00

GenAutoML：用于时间序列分析中动态架构生成与优化的Agentic框架

arXiv:2606.05860v1 Announce Type: new Abstract: Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically r…

arXiv cs.CL TIER_1 English(EN) · Yingzhuo Liu · 2026-06-05 04:00

超越 token：LLM 驱动的多智能体系统中潜在通信的统一框架

arXiv:2606.05711v1 Announce Type: new Abstract: Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agent…

arXiv cs.CL TIER_1 English(EN) · Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang · 2026-06-05 04:00

超越对齐：多文化智能体系统中的集体属性价值多样性

arXiv:2606.05985v1 Announce Type: new Abstract: Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a …

arXiv cs.CL TIER_1 English(EN) · Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao · 2026-06-05 04:00

CollabSim：一种基于CSCW的方法，通过受控多智能体实验研究LLM智能体的协作能力

arXiv:2606.06399v1 Announce Type: new Abstract: Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests tha…

arXiv cs.LG TIER_1 English(EN) · Hao Bai, Rui Yang, Chenlu Ye, Spencer Whitehead, Aviral Kumar, Tong Zhang · 2026-06-05 04:00

AsyncWebRL：面向视觉网页代理的高效多步强化学习

arXiv:2606.05597v1 Announce Type: new Abstract: Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Xueguang Ma · 2026-06-05 03:47

迈向检索智能体搜索的交互空间

Retrieval for search agents is still inherited from non-agentic information retrieval: a retriever ranks the corpus and the agent reads a small set of returned documents. Recent direct corpus interaction (DCI) work shows that agents can instead interact with the raw corpus throug…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

Socratic-SWE：通过追踪派生智能体技能实现自演进编码智能体

Socratic-SWE enables self-evolving software engineering agents by leveraging historical solving traces to generate targeted repair tasks that improve agent performance through iterative refinement.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

DuMate-DeepResearch：一个可审计的多代理系统，具备递归搜索和基于评分标准的推理能力

A multi-agent framework for deep research tasks that addresses planning, evidence acquisition, and report synthesis through decoupled components and dynamic optimization mechanisms.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-05 00:00

迈向检索代理搜索的交互空间

RISE framework constructs bounded interaction spaces for agentic search by combining BM25 retrieval with preprocessed document indexing to enable efficient corpus exploration while maintaining high accuracy at scale.

arXiv cs.CL TIER_1 English(EN) · Lichao Sun · 2026-06-04 21:55

OpenSkill：LLM智能体的开放世界自我进化

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work…

arXiv cs.CL TIER_1 English(EN) · Ian Lane · 2026-06-04 20:48

面向长视界网络代理的信号驱动观测

Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation fr…

arXiv cs.AI TIER_1 English(EN) · Thierry Tambe · 2026-06-04 17:44

Agent Memory：有状态长时工作负载的特征描述与系统影响

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory sys…

arXiv cs.CL TIER_1 English(EN) · Bingsheng Yao · 2026-06-04 17:06

CollabSim：一种基于CSCW的方法，通过受控的多智能体实验来研究LLM智能体的协作能力

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack indiv…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 14:34

从奖励破解激活到代理风险状态：LLM代理中的上下文校准机制监控

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop.…

arXiv cs.AI TIER_1 English(EN) · Odej Kao · 2026-06-04 14:34

从奖励劫持激活到代理风险状态：LLM代理中的上下文校准机制监控

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop.…

arXiv cs.AI TIER_1 English(EN) · Yuta Nakashima · 2026-06-04 13:03

迈向健康演化：探索人机交互在自演化系统中的作用与机制

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolvin…

arXiv cs.CL TIER_1 English(EN) · Wenxuan Zhang · 2026-06-04 10:26

超越对齐：多文化智能体系统中的集体属性价值多样性

Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet align…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 09:26

回顾性约束优化：通过轨迹回滚的自我偏好改进LLM代理

Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 04:52

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-base…

arXiv cs.AI TIER_1 English(EN) · Pietro Lugato, Luca Lavezzo, Jason Mohoney, Hasan Ozturk, Muhammad Hassan Ahmed, Juan Pablo Salas, Viphava Ohm, Krittin Phornsiricharoenphant, Gabriele Benelli, Mariarosaria D'Alfonso, Manasvita Joshi, Warren Nam, Aron Soha, Samantha Sunnarborg, Austin S… · 2026-06-04 04:00

Archi: CMS实验中的Agentic操作

arXiv:2606.04755v1 Announce Type: cross Abstract: We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensibl…

arXiv cs.AI TIER_1 English(EN) · Xinyu Lu, Tianshu Wang, Pengbo Wang, zujie wen, Zhiqiang Zhang, Jun Zhou, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun · 2026-06-04 04:00

Meta-Agent 挑战：当前 Agent 能否实现自主 Agent 开发？

arXiv:2606.04455v1 Announce Type: new Abstract: Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We int…

arXiv cs.AI TIER_1 English(EN) · Zhichao Yang, Yuanze Hu, Haojie Hao, Longkun Hao, Dongshuo Huang, Hongyu Lin, Gen Li, Lanqing Hong, Yihang Lou, Yan Bai · 2026-06-04 04:00

MIRAGE：具有隐式推理和生成式世界模型的移动代理

arXiv:2606.04627v1 Announce Type: new Abstract: Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. Howeve…

arXiv cs.AI TIER_1 English(EN) · Zachary Blumenfeld, Jim Webber · 2026-06-04 04:00

AIP：用于学习和治理代理技能的图表示

arXiv:2606.04781v1 Announce Type: new Abstract: Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and diff…

arXiv cs.AI TIER_1 English(EN) · Samuel H. Christie V, Amit K. Chopra, Munindar P. Singh · 2026-06-04 04:00

Strabo：Agentic 交互协议的声明式规范与实现

arXiv:2606.05043v1 Announce Type: new Abstract: The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing i…

arXiv cs.AI TIER_1 English(EN) · Zexun Wang · 2026-06-04 04:00

Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems

arXiv:2606.04104v1 Announce Type: cross Abstract: Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data exter…

arXiv cs.AI TIER_1 English(EN) · Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen · 2026-06-04 04:00

多智能体推理中的流式通信

arXiv:2606.05158v1 Announce Type: cross Abstract: Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step t…

arXiv cs.AI TIER_1 English(EN) · Pavan C Shekar, Aswanth Krishnan · 2026-06-04 04:00

自适应心智：赋能代理使用 LoRA-as-Tools

arXiv:2510.15416v2 Announce Type: replace Abstract: We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invoke. We hypothesize that, when adapters are trained to provide strong domain-specific gains…

arXiv cs.AI TIER_1 English(EN) · Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yan Liu, Yue Zhao, Xiyang Hu · 2026-06-04 04:00

拓扑学很重要：衡量多智能体LLM中的内存泄漏

arXiv:2512.04668v4 Announce Type: replace-cross Abstract: Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a controlled evaluation framework for compa…

arXiv cs.AI TIER_1 English(EN) · Hojjat Navidan, Mohammad Cheraghinia, Jaron Fontaine, Mohamed Seif, Eli De Poorter, H. Vincent Poor, Ingrid Moerman, Adnan Shahid · 2026-06-04 04:00

迈向自主O-RAN：用于实时网络控制与管理的、多尺度的Agentic AI框架

arXiv:2602.14117v2 Announce Type: replace-cross Abstract: Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces, but this programmability also increases operational complexity. Multiple control …

arXiv cs.CL TIER_1 English(EN) · Xinyu Pang, Zhanke Zhou, Xuan Li, Fangrui Lv, Shanshan Wei, Sen Cui, Bo Han, Changshui Zhang · 2026-06-04 04:00

审慎演化：用于LLM样本高效符号回归的Agentic推理

arXiv:2606.04360v1 Announce Type: new Abstract: Symbolic regression (SR) discovers compact mathematical expressions from data, yet recent LLM-based evolutionary methods remain sample-inefficient because they rely mainly on scalar feedback such as MSE. We identify a core limitatio…

arXiv cs.CL TIER_1 English(EN) · Yuqian Wu, Zhijie Deng, Wei Chen, Junwei Li, Yutian Jiang, Junle Chen, Zhengjun Huang, Qingxiang Liu, Jing Tang, Jiaheng Wei, Yuxuan Liang · 2026-06-04 04:00

LifeSide：将智能体作为终身数字伴侣进行基准测试

arXiv:2606.04660v1 Announce Type: new Abstract: Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-ter…

arXiv cs.CL TIER_1 English(EN) · Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng · 2026-06-04 04:00

Agent Planning Benchmark：LLM Agent 规划能力诊断框架

arXiv:2606.04874v1 Announce Type: new Abstract: Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, makin…

arXiv cs.CL TIER_1 English(EN) · Aliakbar Mehdizadeh, Martin Hilbert · 2026-06-04 04:00

探索共识的拓扑与记忆：LLM 代理在形成约定时的同意、碎片化或稳定方式

arXiv:2606.04197v1 Announce Type: cross Abstract: How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Acro…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

AURA：面向隐式需求浮现的意图导向探测，用于情境化LLM代理

AURA enhances query answering by incorporating an intent inference step that estimates implicit needs and optimizes tool usage through gap scoring, achieving better implicit-need coverage and reduced probe consumption compared to standard approaches.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

当工具失效时：LLM智能体动态重规划与异常恢复的基准测试

ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

OpenSkill：LLM智能体的开放世界自我进化

OpenSkill enables self-evolving agents to develop skills and verification signals from scratch using open-world resources without target-task supervision, achieving high automated performance across benchmarks.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

AsyncWebRL：面向视觉网页代理的高效多步强化学习

AsyncWebRL improves vision-language web agent training through asynchronous reinforcement learning and trajectory normalization modifications, achieving faster throughput and better performance on challenging tasks.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Amit K. Chopra · 2026-06-03 19:52

Ahoy：LLMs 正在制定多智能体交互协议

An interaction protocol formalizes how the agents in a multiagent system interact, which facilitates implementing agents. Existing approaches yield agent implementations specific to the selected protocols. How can we engineer intelligent agents that can enact protocols but are pr…

arXiv cs.CL TIER_1 English(EN) · Ying-Cong Chen · 2026-06-03 17:57

多智能体推理中的流式通信

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pi…

arXiv cs.AI TIER_1 English(EN) · Munindar P. Singh · 2026-06-03 16:05

Strabo：Agentic 交互协议的声明式规范与实现

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Dexing Liu · 2026-06-03 13:55

通道断裂：多智能体编排系统中计划性跨智能体记忆注入的架构盲点

Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory -- a common requirement in hierarchical team architectures -- the delivery mechanis…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Levent Liu · 2026-06-03 13:55

通道断裂：多智能体编排系统中计划性跨智能体记忆注入的架构盲点

Multi-agent AI orchestration systems increasingly rely on persistent memory to maintain context across sessions, agents, and tasks. When one agent must inject knowledge into another agent's memory -- a common requirement in hierarchical team architectures -- the delivery mechanis…

arXiv cs.CL TIER_1 English(EN) · Yu Cheng · 2026-06-03 13:37

Agent Planning Benchmark：LLM Agent 规划能力诊断框架

Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures ste…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 12:02

AIP：用于学习和治理代理技能的图表示

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since …

arXiv cs.LG TIER_1 English(EN) · Jim Webber · 2026-06-03 12:02

AIP：用于学习和治理代理技能的图表示

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 11:38

Archi：CMS实验中的Agentic操作

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An in…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Christoph Paus · 2026-06-03 11:38

Archi：CMS实验中的Agentic操作

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An in…

arXiv cs.CL TIER_1 English(EN) · Yuxuan Liang · 2026-06-03 09:37

LifeSide：将智能体作为终身数字伴侣进行基准测试

Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we i…

arXiv cs.AI TIER_1 English(EN) · Yan Bai · 2026-06-03 09:01

MIRAGE：具有隐式推理和生成式世界模型的移动代理

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as l…

arXiv cs.AI TIER_1 English(EN) · Yingqi Zhang · 2026-06-03 04:00

Agent libOS：受库操作系统启发的、用于长期运行、能力受控的大语言模型代理的运行时

arXiv:2606.03895v1 Announce Type: cross Abstract: Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate …

arXiv cs.AI TIER_1 English(EN) · Farooq Shaikh · 2026-06-03 04:00

FORGE：多智能体渐进式利用与检测工程

arXiv:2606.03453v1 Announce Type: cross Abstract: Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largel…

arXiv cs.AI TIER_1 English(EN) · Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu · 2026-06-03 04:00

MUSE: MLLMs 的统一代理框架

arXiv:2606.03005v1 Announce Type: cross Abstract: Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining th…

arXiv cs.AI TIER_1 English(EN) · Hengrui Gu, Xiaotian Han, Kaixiong Zhou · 2026-06-03 04:00

WRIT：面向多轮用户代理的写读密集型轨迹合成

arXiv:2606.02908v1 Announce Type: cross Abstract: Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and execute valid actions. A training trajectory records this process as an interleaved sequenc…

arXiv cs.AI TIER_1 English(EN) · Zhenting Qi, Huangyuan Su, Ao Qu, Chenyu Wang, Yu Yao, Han Zheng, Kushal Chattopadhyay, Guowei Xu, Zihan Wang, Weirui Ye, Vijay Janapa Reddi, Ju Li, Paul Pu Liang, Himabindu Lakkaraju, Sham Kakade, Yilun Du · 2026-06-03 04:00

心灵经济：新兴的具有经济互动性的多智能体智能

arXiv:2606.02859v1 Announce Type: cross Abstract: How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study thi…

arXiv cs.AI TIER_1 English(EN) · Bla\v{z} Bertalani\v{c}, Carolina Fortuna · 2026-06-03 04:00

Ringelmann效应在多智能体LLM系统中的应用：有效团队规模的缩放定律

arXiv:2606.02646v1 Announce Type: cross Abstract: Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-\beta})$ where the regime …

arXiv cs.AI TIER_1 English(EN) · Linwu Zhu, Liqiang Gao, Yan Chen, Dan Zhu, Jian Huang · 2026-06-03 04:00

LAP：一种用于自主科学的代理到仪器协议

arXiv:2606.03755v1 Announce Type: new Abstract: Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving laboratories execute them. Yet every such system rebuilds the link between the reasoning agent and…

arXiv cs.AI TIER_1 English(EN) · Louis Nisiotis, Aimilios Hadjiliasi · 2026-06-03 04:00

从提示到服务：基于SLM的AI驱动虚拟世界代理编排网关

arXiv:2606.03557v1 Announce Type: new Abstract: As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend model…

arXiv cs.AI TIER_1 English(EN) · Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai · 2026-06-03 04:00

SAGE：Agent生态系统中社会化进化的定量评估

arXiv:2606.03544v1 Announce Type: new Abstract: Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcome…

arXiv cs.AI TIER_1 English(EN) · Taiyu Zhu, Yifan Wu, Weilin Jin, Ying Li, Gang Huang · 2026-06-03 04:00

StepFinder：多智能体系统中故障归因的时间语义框架

arXiv:2606.03467v1 Announce Type: new Abstract: LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and l…

arXiv cs.AI TIER_1 English(EN) · Po-Nien Kung, Linfeng Song, Dawsen Hwang, Jinsung Yoon, Chun-Liang Li, Simone Severini, Mirek Ol\v{s}\'ak, Edward Lockhart, Quoc V Le, Burak Gokturk, Thang Luong, Tomas Pfister, Nanyun Peng · 2026-06-03 04:00

LEAP：利用代理框架为形式数学超级赋能大型语言模型

arXiv:2606.03303v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose fo…

arXiv cs.AI TIER_1 English(EN) · Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Ying Zhao, Zhijiang Guo, Wei Wang · 2026-06-03 04:00

LLM智能体中的不确定性感知澄清与信息增益

arXiv:2606.03135v1 Announce Type: new Abstract: Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification f…

arXiv cs.AI TIER_1 English(EN) · Victor Ojewale, Suresh Venkatasubramanian · 2026-06-03 04:00

基准测试无法衡量的东西：自主代理在弃权能力评估中的案例

arXiv:2606.02965v1 Announce Type: new Abstract: Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural t…

arXiv cs.AI TIER_1 English(EN) · Chirag Parmar, Akshat Mehta, Henglin Wu, Jagadish Ramamurthy, Shweta Medhekar · 2026-06-03 04:00

当帮助适得其反以及如何解决：多智能体辩论用于数据清理

arXiv:2606.02866v1 Announce Type: new Abstract: When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four…

arXiv cs.AI TIER_1 English(EN) · Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi · 2026-06-03 04:00

Agent 的第一天：在工作场所场景中对学习、探索和调度进行基准测试

arXiv:2601.08173v2 Announce Type: replace Abstract: The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic …

arXiv cs.AI TIER_1 English(EN) · Jiahao Huang, Peilan Xu, Xiaoya Nan, Wenjian Luo · 2026-06-03 04:00

面向自动化优化的共演化智能体架构与可解释推理

arXiv:2604.17708v2 Announce Type: replace Abstract: Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical form…

arXiv cs.AI TIER_1 English(EN) · Yuxiang Wei, Zhiqing Sun, Emily McMilin, Jonas Gehring, David Zhang, Gabriel Synnaeve, Daniel Fried, Lingming Zhang, Sida Wang · 2026-06-03 04:00

通过自我对弈SWE-RL训练超智能软件代理

arXiv:2512.18552v3 Announce Type: replace-cross Abstract: While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments …

arXiv cs.CL TIER_1 English(EN) · Zheng Liu, Longxiang Zhang, Xintong Wang, Zhiang Xu, Shaoxiong Zhan, Xin Shan, Wen Huang, Tao Dai, Shu-Tao Xia, Chengfu Huo, Liang Ding · 2026-06-03 04:00

ARBOR：通过可重用的评分标准缓冲区为搜索代理提供在线流程奖励

arXiv:2606.03239v1 Announce Type: new Abstract: LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctn…

arXiv cs.CL TIER_1 English(EN) · Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao · 2026-06-03 04:00

RealClawBench：来自真实开发者-代理会话的实时OpenClaw基准测试

arXiv:2606.03889v1 Announce Type: new Abstract: Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built …

arXiv cs.CL TIER_1 English(EN) · Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang · 2026-06-03 04:00

Skill-RM：通过智能体技能统一异构评估标准

arXiv:2606.03980v1 Announce Type: cross Abstract: Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria su…

arXiv cs.LG TIER_1 English(EN) · Saptarshi Mitra, Yifan Zhang, Rachid Karami, Phyo Pyae Moe Aung, Nazmul Takbir, Sreetama Sarkar, Souvik Kundu, Sitao Huang · 2026-06-03 04:00

MOSAIC：通过自适应聚合和推理并发实现高效的混合代理调度

arXiv:2606.03014v1 Announce Type: new Abstract: Mixture-of-Agents (MoA) systems improve reasoning accuracy by routing each query to multiple expert LLMs and aggregating their outputs. Efficiently executing this workload on limited GPU resources has bottlenecks. Skill-based routin…

arXiv cs.LG TIER_1 English(EN) · Sangeun Park, Minhae Kwon · 2026-06-03 04:00

Multi$^2$：在交互式环境中，使用基于LLM的代理进行分层多代理决策制定

arXiv:2606.03698v1 Announce Type: new Abstract: A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual rea…

arXiv cs.AI TIER_1 English(EN) · Dongwon Jung, Peng Shi, Muhao Chen, Yi Zhang · 2026-06-03 04:00

FutureWeaver：为多智能体系统规划测试时计算与模块化协作

arXiv:2512.11213v2 Announce Type: replace Abstract: Scaling test-time computation has been shown to significantly improve large language model (LLM) performance without additional training. However, extending these techniques to multi-agent systems remains challenging: existing a…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jiangbo Yu · 2026-06-03 00:25

组织控制层：LLM代理系统执行边界的治理基础设施

LLM-based agents are increasingly deployed in workflows where generated outputs may directly trigger state-changing actions. This creates an execution-boundary problem: proposed actions must be governed before they are executed. We study this problem through economically conseque…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

代理应该说什么？用于高效多代理系统的动作-状态通信

Multi-agent systems using large language models suffer from inefficient token consumption in agent-to-agent communication, which PACT addresses by structuring messages as compact action-state records that improve performance-cost trade-offs across different system architectures.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

多智能体推理中的流式通信

StreamMA enables efficient multi-agent reasoning by streaming intermediate results and leveraging reliable early steps to improve both latency and effectiveness across various reasoning tasks.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

Meta-Agent 挑战：当前 Agent 能否实现自主 Agent 开发？

The Meta-Agent Challenge evaluates AI models' ability to autonomously develop agent systems through iterative programming within constrained environments, revealing significant gaps in current models' self-improvement capabilities.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Martin Hilbert · 2026-06-02 20:31

探索共识的拓扑与记忆：LLM 智能体在形成约定时的同意、碎片化或稳定方式

How much should an LLM agent remember, and how should multi-agent systems be connected when trying to reach consensus? We show these two design choices interact in a way that flips the sign of memory's effect on coordination. Across 432 simulation runs of a networked Naming Game …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 17:56

Skill-RM：通过代理技能统一异构评估标准

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth reference…

arXiv cs.CL TIER_1 English(EN) · Guanjun Jiang · 2026-06-02 17:56

Skill-RM：通过代理技能统一异构评估标准

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth reference…

arXiv cs.AI TIER_1 English(EN) · Yingqi Zhang · 2026-06-02 16:53

Agent libOS：一种受库操作系统启发的、用于长期运行、能力受控的大语言模型代理的运行时

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resum…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 16:53

Agent libOS：受库操作系统启发的，用于长期运行、能力受控的大模型代理的运行时

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resum…

arXiv cs.CL TIER_1 English(EN) · Guangxiang Zhao · 2026-06-02 16:51

RealClawBench：来自真实开发者-代理会话的实时OpenClaw基准测试

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distr…

arXiv cs.AI TIER_1 English(EN) · Jian Huang · 2026-06-02 15:03

LAP：一种用于自主科学的代理到仪器协议

Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving laboratories execute them. Yet every such system rebuilds the link between the reasoning agent and the physical instrument from scratch, against f…

arXiv cs.LG TIER_1 English(EN) · Minhae Kwon · 2026-06-02 14:20

Multi$^2$：在交互式环境中基于LLM的代理进行分层多智能体决策制定

A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual reasoning, their long-horizon decision-making remai…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Xing Sun · 2026-06-02 12:30

技能并非文档：用于LLM智能体技能路由的条件查询基准和两阶段检索器

LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front-end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top-K joint correctness depends not only on the semantic releva…

arXiv cs.CL TIER_1 English(EN) · Xunliang Cai · 2026-06-02 12:08

SAGE：Agent生态系统中社会化进化的定量评估

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-stu…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Farooq Shaikh · 2026-06-02 10:32

FORGE：多智能体渐进式利用与检测工程

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generat…

arXiv cs.CL TIER_1 English(EN) · Liang Ding · 2026-06-02 06:58

ARBOR：通过可重用的评分标准缓冲区为搜索代理提供在线流程奖励

LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no…

arXiv cs.AI TIER_1 English(EN) · Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang · 2026-06-02 04:00

通过故障感知可观测性早期诊断多智能体LLM系统中的计算浪费

arXiv:2606.01365v1 Announce Type: new Abstract: Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but…

arXiv cs.AI TIER_1 English(EN) · Ruiyin Li, Yiran Zhang, Xiyu Zhou, Yangxiao Cai, Peng Liang, Weisong Sun, Jifeng Xuan, Zhi Jin, Yang Liu · 2026-06-02 04:00

连接需求与架构：利用外部知识和分层记忆进行多智能体编排

arXiv:2606.01385v1 Announce Type: cross Abstract: Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality attributes and adapting to evolving requirements. Traditionally, this process has been time-…

arXiv cs.AI TIER_1 English(EN) · Nagarjuna Kanamarlapudi, Praveen K · 2026-06-02 04:00

LLM 软件设计精炼联盟：多智能体协作拓扑的受控实验

arXiv:2606.01490v1 Announce Type: cross Abstract: We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 …

arXiv cs.AI TIER_1 English(EN) · Ankur Sharma, Deep Shah · 2026-06-02 04:00

Agent Operating Systems (AOS)：将代理控制平面集成到传统操作系统及更广泛的领域

arXiv:2606.01508v1 Announce Type: cross Abstract: Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core abstractions processes, threads, system calls, files, and permissions assume bounded behav…

arXiv cs.AI TIER_1 English(EN) · Zewen Liu, Zhan Shi, Yisi Sang, Bing He, Minhua Lin, Tianxin Wei, Dakuo Wang, Benoit Dumoulin, Wei Jin, Hanqing Lu · 2026-06-02 04:00

自适应自动约束：在开放式任务流上部署代理系统的持续自我改进

arXiv:2606.01770v1 Announce Type: cross Abstract: Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offl…

arXiv cs.AI TIER_1 English(EN) · Mikael Gorsky · 2026-06-02 04:00

ASE-26：作为一门学科的代理软件工程课程

arXiv:2606.01152v1 Announce Type: cross Abstract: The work of a professional software engineer has begun to consist, increasingly, of directing agents rather than writing code, and the empirical evidence for the shift is now several years deep. Anthropic's Economic Index puts aut…

arXiv cs.AI TIER_1 English(EN) · Hiskias Dingeto, Will Leeney · 2026-06-02 04:00

AgentRedBench：面向SaaS集成的LLM代理的动态红队测试和集成感知防御

arXiv:2606.02240v1 Announce Type: cross Abstract: Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user …

arXiv cs.AI TIER_1 English(EN) · Thanh Luong Tuan · 2026-06-02 04:00

企业多智能体系统的动态协调策略选择

arXiv:2606.00804v1 Announce Type: cross Abstract: Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether …

arXiv cs.AI TIER_1 English(EN) · Jialing Li, Zhouhong Gu, Yin Cai, Hongwei Feng · 2026-06-02 04:00

单一大型语言模型驱动的多智能体系统的规模化行为研究

arXiv:2606.00655v1 Announce Type: cross Abstract: The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain u…

arXiv cs.AI TIER_1 English(EN) · Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, Jingzhou Xu · 2026-06-02 04:00

当安全技能碰撞时：衡量智能体技能生态系统的组合风险

arXiv:2606.00448v1 Announce Type: cross Abstract: LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe install…

arXiv cs.AI TIER_1 English(EN) · Maria Katarine Santana Barbosa, Kelvin L. Dias · 2026-06-02 04:00

AgentxGCore：下一代移动核心网的智能体AI

arXiv:2606.00417v1 Announce Type: cross Abstract: To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generation Mobile Networks (NextG), or 6G, will adopt an AI-native architecture on the Core Networ…

arXiv cs.AI TIER_1 English(EN) · Marisa Ferrara Boston, Glen Hanson, Effi Georgala, JD Hudgens, Heather Frase · 2026-06-02 04:00

在智能体系统可靠之前对其进行监控

arXiv:2606.02494v1 Announce Type: cross Abstract: Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be in…

arXiv cs.AI TIER_1 English(EN) · Haoxiang Zhang, Qixin Xu, Zhuofeng Li, Lei Zhang, Pengcheng Jiang, Yu Zhang, Julian McAuley · 2026-06-02 04:00

掩盖陈旧观测有助于搜索代理——直到失效为止：一种状态图及其机制

arXiv:2606.00408v1 Announce Type: cross Abstract: Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the…

arXiv cs.AI TIER_1 English(EN) · Nazmus Ashrafi · 2026-06-02 04:00

生成架构如何塑造多智能体LLM系统中代码的复杂性：一项关于HumanEval的配对研究

arXiv:2606.00308v1 Announce Type: cross Abstract: Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether th…

arXiv cs.AI TIER_1 English(EN) · Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, Cassandra A. Cohen, Arthur Kajiyama, William W. Cohen · 2026-06-02 04:00

学习构建实用的代理系统

arXiv:2606.00189v1 Announce Type: cross Abstract: Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that pro…

arXiv cs.AI TIER_1 English(EN) · Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai · 2026-06-02 04:00

关于Agentic工具调用和RL训练的有效性和效率

arXiv:2606.00135v1 Announce Type: cross Abstract: Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how …

arXiv cs.AI TIER_1 English(EN) · Yun Qu, Boyuan Wang, Yuhang Jiang, Jianzhun Shao, Yixiu Mao, Heming Zou, Chang Liu, Cheems Wang, Meiqin Liu, Xiangyang Ji · 2026-06-02 04:00

停止漫游，寻找关键：LLMs 区分关键状态以实现高效多智能体探索

arXiv:2410.02511v2 Announce Type: replace Abstract: With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant effo…

arXiv cs.CL TIER_1 English(EN) · James Xu Zhao, Hui Chen, Bryan Hooi, See-Kiong Ng · 2026-06-02 04:00

FineVerify：通过细粒度自验证为代理搜索扩展测试时计算

arXiv:2606.00660v1 Announce Type: new Abstract: Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because c…

arXiv cs.CL TIER_1 English(EN) · Xiqi Hao, Zengqing Wu, Yu-Xuan Qiu, Chuan Xiao, Ruiqi Xu, Shuyuan Zheng, Jianbin Qin · 2026-06-02 04:00

并非所有“翻转”都是趋同：分解多智能体LLM辩论中的立场收敛

arXiv:2606.00820v1 Announce Type: new Abstract: Multi-agent debate (MAD) is a promising strategy for improving LLM reasoning, but when agents converge on a shared answer, it is unclear whether that convergence reflects genuine deliberation or social compliance. We show that the c…

arXiv cs.CL TIER_1 English(EN) · Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou · 2026-06-02 04:00

HarnessForge：自适应代理系统的联合 Harness 和策略演化

arXiv:2606.01779v1 Announce Type: new Abstract: LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component up…

arXiv cs.CL TIER_1 English(EN) · Danqing Wang, Akshay Sivaraman, Lei Li · 2026-06-02 04:00

CRAB-Bench：在复杂任务依赖和人类对齐的用户模拟下评估LLM代理

arXiv:2606.01815v1 Announce Type: new Abstract: Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agen…

arXiv cs.CL TIER_1 English(EN) · Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang · 2026-06-02 04:00

SciAgentGym：对大语言模型智能体多步科学工具使用的基准测试

arXiv:2602.12984v2 Announce Type: replace Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To brid…

arXiv cs.CL TIER_1 English(EN) · Yifan Shi, Jiayi Wang, Minyi Wu, Ye Fan, Jialong Shi, Jianyong Sun · 2026-06-02 04:00

MIRROR：一种用于运筹学优化建模的迭代自适应修订和分层检索的多智能体框架

arXiv:2602.03318v3 Announce Type: replace Abstract: Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existi…

arXiv cs.CL TIER_1 Română(RO) · Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried · 2026-06-02 04:00

多智能体计算机使用

arXiv:2606.01533v1 Announce Type: cross Abstract: Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on…

arXiv cs.AI TIER_1 English(EN) · Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee · 2026-06-02 04:00

MobiBench：面向移动GUI代理的多分支、模块化基准测试

arXiv:2512.12634v4 Announce Type: replace Abstract: Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundament…

arXiv cs.CL TIER_1 English(EN) · Zixuan Zhu, Yitong Hu, Yong Dai, Junfeng Fang, Chunyang Jiang, Senkang Hu, Yuzhi Zhao · 2026-06-02 04:00

统一上下文演进用于LLM代理

arXiv:2606.02304v1 Announce Type: new Abstract: LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task end…

arXiv cs.CL TIER_1 English(EN) · Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao, Shuqing Bian, Wei Lu, Xiaoyong Du · 2026-06-02 04:00

通过接地交互合成扩展代理能力

arXiv:2606.02001v1 Announce Type: new Abstract: General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human ann…

arXiv cs.AI TIER_1 English(EN) · Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge, Lei Jiang, Kevin Zhang, Raad Khraishi, Yihao Ang, Anthony K. H. Tung, Lukasz Szpruch, Hao Ni · 2026-06-02 04:00

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

arXiv:2606.00708v1 Announce Type: new Abstract: Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML …

arXiv cs.AI TIER_1 English(EN) · Md Nakhla Rafi, Md Ahasanuzzaman, Dong Jae Kim, Zhijie Wang, Tse-Hsun Chen · 2026-06-02 04:00

FALAT：通过依赖引导搜索追踪LLM代理轨迹中的失败

arXiv:2606.00765v1 Announce Type: new Abstract: LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure an…

arXiv cs.AI TIER_1 English(EN) · Jonah Leshin, Manish Shah, Ian Timmis · 2026-06-02 04:00

追踪适应性智能体的行为轨迹

arXiv:2606.02536v1 Announce Type: new Abstract: Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly ste…

arXiv cs.AI TIER_1 English(EN) · Leheng Chen, Zihao Liu, Wanyi He, Bin Dong · 2026-06-02 04:00

Iteris：计算数学的代理研究循环

arXiv:2606.02484v1 Announce Type: new Abstract: Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computa…

arXiv cs.AI TIER_1 English(EN) · Wenhao Wang, Peizhi Niu, Gongyi Zou, Xiyuan Yang, Jingxing Wang, Haoting Shi, Yaxin Du, Jingyi Chai, Xianghe Pang, Shuo Tang, Yanfeng Wang, Siheng Chen · 2026-06-02 04:00

MCP-Persona：通过环境模拟在真实个人应用上对LLM代理进行基准测试

arXiv:2606.02470v1 Announce Type: new Abstract: The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development pl…

arXiv cs.AI TIER_1 English(EN) · Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li · 2026-06-02 04:00

COMAP：LLM智能体模型与智能体策略的协同演化

arXiv:2606.02372v1 Announce Type: new Abstract: Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them …

arXiv cs.AI TIER_1 English(EN) · Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan, Qiang Duan · 2026-06-02 04:00

MOC：基于LLM的多智能体系统中的多阶通信

arXiv:2606.02359v1 Announce Type: new Abstract: Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimi…

arXiv cs.AI TIER_1 English(EN) · I\~naki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-Garc\'ia, Annemarie F. Laudanski, \'Alvaro Guti\'errez, Eduardo Rocon, Manuel Cebrian · 2026-06-02 04:00

POIROT：在多智能体系统中用于故障检测的智能体审问

arXiv:2606.02282v1 Announce Type: new Abstract: Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical d…

arXiv cs.AI TIER_1 English(EN) · Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai · 2026-06-02 04:00

SafeMCP：通过基于环境的前瞻性推理实现LLM代理防御的主动电源调节

arXiv:2606.01991v1 Announce Type: new Abstract: As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power…

arXiv cs.AI TIER_1 English(EN) · Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu · 2026-06-02 04:00

SMH-Bench：在智能家居中对环境感知推理和行动的 LLM Agent 进行基准测试

arXiv:2606.01912v1 Announce Type: new Abstract: Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks ofte…

arXiv cs.AI TIER_1 English(EN) · Bin Chen, Xinye Liao, Yiming Liu, Xin Liao, Chonghan Liu · 2026-06-02 04:00

CAPF：通过信用衰减的特权反馈指导搜索代理的推出

arXiv:2606.01830v1 Announce Type: new Abstract: Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcom…

arXiv cs.AI TIER_1 English(EN) · Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng · 2026-06-02 04:00

通过轨迹驱动模拟对通用任务上的多模型代理AI系统进行表征

arXiv:2606.01725v1 Announce Type: new Abstract: Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent arch…

arXiv cs.AI TIER_1 English(EN) · Nicole Rose Schneider, Davide Ghilardi, Giacomo Piccinini, Jacopo Tagliabue · 2026-06-02 04:00

"技术问题”：湖仓代理的数据中心优化

arXiv:2606.01185v1 Announce Type: new Abstract: Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these …

arXiv cs.AI TIER_1 English(EN) · Xuancheng Zhu, Yang Yue, Shuaibing Wan, Zihan Dou, Xiaohan Zhang, Yongrui Liu, Guoshun Nan · 2026-06-02 04:00

LLM智能体能否维持长期组织动态？

arXiv:2606.01199v1 Announce Type: new Abstract: Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior ex…

arXiv cs.AI TIER_1 English(EN) · Yangbo Wei, Zhen Huang, Shaoqiang Lu, Junhong Qian, Qifan Wang, Chen Wu, Lei He · 2026-06-02 04:00

SkillSmith：为自改进代理系统共同演进技能与工具

arXiv:2606.01314v1 Announce Type: new Abstract: Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently…

arXiv cs.AI TIER_1 English(EN) · Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai · 2026-06-02 04:00

识别你的编排器：面向LLM多智能体系统的熵动力学视角

arXiv:2606.01351v1 Announce Type: new Abstract: The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean…

arXiv cs.AI TIER_1 English(EN) · Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao · 2026-06-02 04:00

LLM4Cov: 用于高覆盖率测试平台生成的执行感知代理学习

arXiv:2602.16953v3 Announce Type: replace Abstract: Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtain, making online reinforcement learning (RL) less practical in certain scenarios. High-co…

arXiv cs.AI TIER_1 English(EN) · Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, Xin Cong, Yankai Lin · 2026-06-02 04:00

AgentProcessBench：诊断使用工具的Agent的步骤级过程质量

arXiv:2603.14465v2 Announce Type: replace Abstract: While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequ…

arXiv cs.AI TIER_1 English(EN) · Huayi Lai, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Zhouxing Wang, Zhiqiang Yin, Xun Liang · 2026-06-02 04:00

RoleCDE：对角色扮演代理中的角色对齐权衡进行基准测试和缓解

arXiv:2606.01552v1 Announce Type: new Abstract: Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mainly evaluate surface-level fidelity and offer limited insight into decision making under role…

arXiv cs.AI TIER_1 English(EN) · Rahul Suresh Babu, Adarsh Agrawal · 2026-06-02 04:00

用于可靠的工具增强大型语言模型系统的自愈代理编排器

arXiv:2606.01416v1 Announce Type: new Abstract: Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but a…

arXiv cs.AI TIER_1 English(EN) · Wenchang Duan, Zhenguo Gao, Jinguo Xian, Yi Shi · 2026-06-02 04:00

MAVEN-T：用于实时多智能体轨迹预测的增强异构蒸馏

arXiv:2604.10169v2 Announce Type: replace Abstract: Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior planning, and control. The task remains challenging under dense interactions, heterogeneo…

arXiv cs.AI TIER_1 English(EN) · Jiaru Zou, Ruizhong Qiu, Gaotang Li, Xiyuan Yang, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang · 2026-06-02 04:00

多智能体系统中的潜在协作

arXiv:2511.20639v3 Announce Type: replace-cross Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and co…

arXiv cs.AI TIER_1 English(EN) · Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · 2026-06-02 04:00

Agent Tools Orchestration 泄露更多：数据集、基准测试和缓解措施

arXiv:2512.16310v3 Announce Type: replace-cross Abstract: LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintende…

arXiv cs.AI TIER_1 English(EN) · Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier, Andreas Vlachos · 2026-06-02 04:00

揭秘多智能体辩论：置信度和多样性的作用

arXiv:2601.19921v2 Announce Type: replace-cross Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computatio…

arXiv cs.AI TIER_1 English(EN) · Bardia Mohammadi, Nearchos Potamitis, Lars Klein, Akhil Arora, Laurent Bindschaedler · 2026-06-02 04:00

Atomix：为可靠的代理工作流提供及时、事务性的工具使用

arXiv:2602.14849v2 Announce Type: replace-cross Abstract: LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement trigger, so faults, speculation, and concurrent agents can leave partial effects, l…

arXiv cs.AI TIER_1 English(EN) · Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn · 2026-06-02 04:00

面向LLM智能体图谋的宪法黑箱监控

arXiv:2603.00829v2 Announce Type: replace-cross Abstract: Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Dheeraj Kumar · 2026-06-02 03:59

SPOQ：多智能体软件工程的专家协调队列

Multi-agent AI systems show promise for automating software engineering tasks, yet existing approaches suffer from coordination overhead, quality control gaps, and limited human oversight. We introduce SPOQ (Specialist Orchestrated Queuing), a methodology combining three innovati…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

Lean4Agent：用于智能体工作流和轨迹的形式化建模与验证

Large language models can be equipped with formal verification frameworks using dependent-type languages to improve multi-step workflow reliability and performance.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

Agent libOS：一种受库操作系统启发的、用于长期运行、能力受控的大语言模型代理的运行时

Agent libOS provides a runtime substrate for long-running LLM agents with process-like execution, tool management, and security boundaries implemented through explicit capabilities and runtime primitives.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

Skill-RM：通过代理技能统一异构评估标准

Skill-RM presents a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Shweta Medhekar · 2026-06-01 20:29

当帮助适得其反以及如何解决：多智能体辩论用于数据清理

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induc…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Yilun Du · 2026-06-01 20:21

经济心智：新兴的具有经济互动性的多智能体智能

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agent…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 17:40

追踪适应性智能体的行为轨迹

Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly steering the agent's behavior in future interaction…

arXiv cs.AI TIER_1 English(EN) · Ian Timmis · 2026-06-01 17:40

追踪适应性智能体的行为轨迹

Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly steering the agent's behavior in future interaction…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 17:01

在智能体系统可靠之前对其进行监控

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal…

arXiv cs.AI TIER_1 English(EN) · Heather Frase · 2026-06-01 17:01

在代理式系统可靠之前对其进行监控

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal…

arXiv cs.AI TIER_1 English(EN) · Bin Dong · 2026-06-01 16:54

Iteris：计算数学的智能体研究循环

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively l…

arXiv cs.AI TIER_1 English(EN) · Siheng Chen · 2026-06-01 16:44

MCP-Persona：通过环境模拟对大型语言模型代理在真实个人应用上的表现进行基准测试

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominan…

arXiv cs.AI TIER_1 English(EN) · Wenjie Li · 2026-06-01 15:21

COMAP：LLM智能体世界的模型与策略协同进化

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action dist…

arXiv cs.AI TIER_1 English(EN) · Qiang Duan · 2026-06-01 15:06

MOC：基于LLM的多智能体系统中的多阶通信

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current co…

arXiv cs.CL TIER_1 English(EN) · Yuzhi Zhao · 2026-06-01 14:25

统一上下文演进以增强LLM代理

LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to …

arXiv cs.AI TIER_1 English(EN) · Manuel Cebrian · 2026-06-01 14:05

POIROT：在多智能体系统中用于故障检测的智能体审问

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emergi…

arXiv cs.AI TIER_1 English(EN) · Will Leeney · 2026-06-01 13:34

AgentRedBench：面向SaaS集成的LLM代理的动态红队测试与集成感知防御

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks u…

arXiv cs.CL TIER_1 English(EN) · Xiaoyong Du · 2026-06-01 09:57

通过接地交互合成扩展代理能力

General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on…

arXiv cs.CL TIER_1 English(EN) · Juntao Dai · 2026-06-01 09:48

SafeMCP：通过环境接地前瞻性推理实现LLM智能体防御的主动电源调节

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater e…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 07:00

HarnessForge：自适应代理系统的联合 Harness 和策略演化

LLM agents face challenges in heterogeneous task regimes requiring distinct execution paradigms, prompting the need for system-level meta-adaptation that goes beyond component updates.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 06:51

自适应自动约束：在开放式任务流上部署代理系统的持续自我改进

Adaptive Auto-Harness framework addresses dynamic task streams by decomposing performance gaps into evolution and adaptation losses, utilizing a stateful multi-agent evolver and harness tree with solve-time routing for sustained performance improvement.

arXiv cs.AI TIER_1 English(EN) · Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang · 2026-06-01 04:00

LongDS-Bench：关于长时域智能体数据分析失败的探讨

arXiv:2605.30434v1 Announce Type: cross Abstract: Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce …

arXiv cs.AI TIER_1 English(EN) · Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu · 2026-06-01 04:00

升级而非受益：解耦自进化LLM智能体中的进化能力

arXiv:2605.30621v1 Announce Type: new Abstract: LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such…

arXiv cs.CL TIER_1 English(EN) · Jianxiang Yu, Jiapeng Zhu, Bochen Lin, Qier Cui, Zichen Ding, Xiang Li · 2026-06-01 04:00

技能并非放之四海而皆准：面向LLM智能体的模型感知技能对齐

arXiv:2605.30723v1 Announce Type: new Abstract: LLM agents increasingly retrieve externally curated skills-procedural instructions retrieved at decision time-to improve performance on long-horizon interactive tasks. Existing skill libraries are typically treated as model-agnostic…

arXiv cs.AI TIER_1 English(EN) · Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu · 2026-06-01 04:00

FML-bench：从搜索动力学视角对AI研究代理策略的受控研究

arXiv:2605.17373v2 Announce Type: replace-cross Abstract: AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimizati…

arXiv cs.AI TIER_1 English(EN) · Hao Xiang Li, Michael Amir, Amanda Prorok · 2026-06-01 04:00

使用扩散模型扩展多智能体环境协同设计

arXiv:2511.03100v2 Announce Type: replace-cross Abstract: The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system performance. With application domains ranging from warehouse logistics to windfarm manag…

arXiv cs.AI TIER_1 English(EN) · Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, Xiyang Hu · 2026-06-01 04:00

LLM Agent Skills 的反事实追踪审计

arXiv:2605.11946v2 Announce Type: replace Abstract: Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the sk…

arXiv cs.AI TIER_1 English(EN) · Abhishek Chandwani, Ishan Gupta · 2026-06-01 04:00

LH-Bench：主观企业任务上长时域智能体的技能基础评估

arXiv:2603.22744v2 Announce Type: replace Abstract: Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a single correct answer. In contrast, real-world enterprise work is often subjective and context…

arXiv cs.AI TIER_1 English(EN) · Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta, Case Winter, George Fang, John Ling, Emma Strubell, Zach Kirshner · 2026-06-01 04:00

BlueFin：在金融电子表格上对LLM代理进行基准测试

arXiv:2605.30907v1 Announce Type: cross Abstract: We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global popul…

arXiv cs.AI TIER_1 English(EN) · Omkar Ghugarkar, Vishvesh Bhat, Muhammad Ahmed Mohsin, Asad Aali · 2026-06-01 04:00

MAVEN：改进代理工具调用的泛化能力

arXiv:2605.30738v1 Announce Type: new Abstract: Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose rea…

arXiv cs.AI TIER_1 English(EN) · Junjie Nian, Kang Chen, Ge Zhang, Yixin Cao, Yugang Jiang · 2026-06-01 04:00

TraceGraph：用于诊断和改进代理轨迹的共享决策景观

arXiv:2605.31308v1 Announce Type: new Abstract: Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent tra…

arXiv cs.AI TIER_1 English(EN) · Yunpeng Zhou · 2026-06-01 04:00

资源受限视觉代理中共享状态协作的故障模式诊断

arXiv:2605.31354v1 Announce Type: new Abstract: Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes …

arXiv cs.AI TIER_1 English(EN) · Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng · 2026-06-01 04:00

探索用于模型专业化的自主代理数据工程

arXiv:2605.30407v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primari…

arXiv cs.AI TIER_1 English(EN) · George Fatouros, Georgios Makridis, George Kousiouris, John Soldatos, Dimosthenis Kyriazis · 2026-06-01 04:00

面向受监管网络安全运营的组织范围LLM代理运行时架构

arXiv:2605.30604v1 Announce Type: cross Abstract: Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent lar…

arXiv cs.AI TIER_1 English(EN) · Madhav Jivrajani, Ramnatthan Alagappan, Aishwarya Ganesan · 2026-06-01 04:00

Sophrosyne：关系型数据系统的代理式探索需要适度

arXiv:2605.30862v1 Announce Type: cross Abstract: Text2SQL agents powered by LLMs translate natural language intent into SQL by exploring the data system through tool calls before formulating the query. However, to ensure secure and scoped access, data systems construct environme…

arXiv cs.LG TIER_1 English(EN) · Jeffrey Seely, Bart{\l}omiej Cupia{\l}, Llion Jones · 2026-06-01 04:00

通过Sheaf-ADMM学习多智能体协调

arXiv:2605.31005v1 Announce Type: new Abstract: We present a differentiable optimization framework for multi-agent coordination. An input is decomposed into overlapping local views, each processed by an agent that solves a convex subproblem parameterized by a neural encoder. Agen…

arXiv cs.CL TIER_1 English(EN) · Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang · 2026-06-01 04:00

MineExplorer：评估MLLM智能体在Minecraft中的开放世界探索能力

arXiv:2605.30931v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and gam…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · David Lo · 2026-06-01 02:26

Agent System Operations: 分类、挑战与未来方向

As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industri…

arXiv cs.MA (Multiagent) TIER_1 Română(RO) · Daniel Fried · 2026-06-01 01:29

多智能体计算机使用

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 00:08

Agent Operating Systems (AOS)：将代理控制平面集成到传统操作系统及更广泛的领域

Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core abstractions processes, threads, system calls, files, and permissions assume bounded behavior and predictable interaction patterns. Agentic …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 00:00

MCP-Persona：通过环境模拟对大型语言模型代理在真实个人应用上的基准测试

MCP-Persona benchmark evaluates agent performance on personalized tools interacting with individual accounts and local databases, revealing significant challenges in current SOTA agents.

Hugging Face Daily Papers TIER_1 Română(RO) · 2026-06-01 00:00

多智能体计算机使用

Multi-agent computer use systems outperform single-agent approaches on complex tasks by enabling parallel execution and dynamic task decomposition through directed acyclic graphs.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 00:00

心灵经济：具有经济交互的新兴多智能体智能

Decentralized agent economies with auction-based competition and wealth accumulation enable emergent collective intelligence without central coordination, outperforming monolithic approaches in complex reasoning and optimization tasks.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Praveen K · 2026-05-31 23:15

LLM 软件设计精炼联盟：多智能体协作拓扑的受控实验

We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Carolina Fortuna · 2026-05-31 16:19

Ringelmann效应在多智能体LLM系统中的应用：有效团队规模的标度律

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-β})$ where the regime exponent $β$ classifies any configuration into one of …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Thanh Luong Tuan · 2026-05-30 16:43

企业多智能体系统的动态协调策略选择

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamical…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Thanh Luong Tuan · 2026-05-30 16:43

企业多智能体系统的动态协调策略选择

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamical…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Hongwei Feng · 2026-05-30 09:57

单个人工智能驱动的多智能体系统的规模化行为

The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain underexplored. This paper systematically investigat…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-30 00:00

FineVerify：通过细粒度自验证扩展测试时计算以实现代理搜索

FineVerify is a self-verification framework for agentic search that improves accuracy through decomposed sub-question checking and trajectory selection.

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Julian McAuley · 2026-05-29 22:51

掩盖过时观测有助于搜索代理——直到失效为止：一种状态图及其机制

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear whe…

arXiv cs.AI TIER_1 English(EN) · Yunpeng Zhou · 2026-05-29 14:29

资源受限视觉代理中共享状态协作的故障模式诊断

Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes of collaborative reasoning with weak learners (4…

arXiv cs.AI TIER_1 English(EN) · Yugang Jiang · 2026-05-29 13:40

TraceGraph：用于诊断和改进代理轨迹的共享决策景观

Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For e…

arXiv cs.AI TIER_1 English(EN) · Priyam Sahoo, Gaurav Mittal, Xiaomin Li, Shengjie Ma, Benjamin Steenhoek, Pingping Lin, Yu Hu · 2026-05-29 04:00

AgentLens：揭示SWE-Agent评估中的幸运通过问题

arXiv:2605.12925v2 Announce Type: replace-cross Abstract: Here is the updated abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-only view treats a principled solution and a chaotic tri…

arXiv cs.CL TIER_1 English(EN) · Shuyu Zhang, Yaqi Shi, Lu Wang · 2026-05-29 04:00

PatchBoard：基于模式的可靠且可审计的 LLM 多智能体协作状态变异

arXiv:2605.29313v1 Announce Type: new Abstract: LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collab…

arXiv cs.CL TIER_1 English(EN) · Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada · 2026-05-29 04:00

重新审视网络代理的观测缩减：使用轻量级框架进行全面评估

arXiv:2605.29397v1 Announce Type: new Abstract: HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the…

arXiv cs.AI TIER_1 English(EN) · Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang · 2026-05-29 04:00

AgentDropoutV2：通过测试时纠正或拒绝剪枝优化多智能体系统中的信息流

arXiv:2602.23258v2 Announce Type: replace Abstract: While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-…

arXiv cs.AI TIER_1 English(EN) · Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang · 2026-05-29 04:00

推理与工具使用在智能体强化学习中竞争：从量化干扰到解耦调优

arXiv:2602.00994v2 Announce Type: replace Abstract: Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single set of parameters to support both reasoning …

arXiv cs.AI TIER_1 English(EN) · Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu · 2026-05-29 04:00

SCOPE：提示词演进以增强代理有效性

arXiv:2512.15374v2 Announce Type: replace Abstract: Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the…

arXiv cs.AI TIER_1 English(EN) · Jiazhen Yuan, Zhike Gong, Jinquan Hang, Zhengbiao Bai, Wei Zhao · 2026-05-29 04:00

LLM 智能体训练中的图增强策略优化

arXiv:2510.26270v2 Announce Type: replace Abstract: Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group-based reinforcement learning is widely adopted, which reinforces trajectories with highe…

arXiv cs.AI TIER_1 English(EN) · Henrique Assump\c{c}\~ao, Diego Ferreira, Leandro Campos, Fabricio Murai · 2026-05-29 04:00

CodeEvolve：一个用于算法发现和优化的开源进化编码代理

arXiv:2510.14150v5 Announce Type: replace Abstract: We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end algorithmic discovery. CodeEvolve integrates inspiration-based crossover, meta-prompting, …

arXiv cs.CL TIER_1 English(EN) · Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che · 2026-05-29 04:00

通过有效反馈计算扩展智能体工具的规模法则

arXiv:2605.29682v1 Announce Type: new Abstract: Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling anal…

arXiv cs.CL TIER_1 English(EN) · Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou · 2026-05-29 04:00

CONCAT: 基于共识和置信度的临时组队，用于高效的基于LLM的多智能体系统

arXiv:2605.29612v1 Announce Type: cross Abstract: Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy …

arXiv cs.CL TIER_1 English(EN) · Jinnuo Liu, Chuke Liu, Hua Shen · 2026-05-29 04:00

ValueFlow：衡量多智能体LLM系统中价值扰动的传播

arXiv:2602.08567v2 Announce Type: replace-cross Abstract: Multi-agent large language model (LLM) systems increasingly consist of agents that observe and respond to one another's outputs. While value alignment is typically evaluated for isolated models, how value perturbations pro…

arXiv cs.CL TIER_1 English(EN) · Kenan Li, Qirui Jin, Liao Zhu, Xiaosong Huang, Yijia Wu, Yikai Zhang, Xin Zhang, Zijian Jin, Yufan Huang, Elsie Nallipogu, Chaoyun Zhang, Yu Kang, Saravan Rajmohan, Qingwei Lin, Wenke Lee, Dongmei Zhang · 2026-05-29 04:00

ORACLE-SWE：量化Oracle信息信号对SWE代理的贡献

arXiv:2604.07789v2 Announce Type: replace-cross Abstract: Recent advances in language model (LM) agents have significantly improved automated software engineering (SWE). Prior work has proposed various agentic workflows and training strategies as well as analyzed failure modes of…

arXiv cs.LG TIER_1 English(EN) · Weicheng Xue · 2026-05-29 04:00

LLM 交易代理中的表示签名与风险反馈对齐

arXiv:2605.28850v1 Announce Type: new Abstract: We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. Using TradeArena, an auditable trading-agent testbed with risk reports, execution simulation, memory,…

arXiv cs.LG TIER_1 English(EN) · Kexin Chu, Dawei Xiang, Wei Zhang · 2026-05-29 04:00

LLM代理中功能等效工具的延迟-质量路由

arXiv:2605.14241v2 Announce Type: replace Abstract: Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider…

arXiv cs.AI TIER_1 English(EN) · Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi · 2026-05-29 04:00

当云端代理遇上设备代理：混合多代理系统的经验教训

arXiv:2605.30102v1 Announce Type: cross Abstract: The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more co…

arXiv cs.AI TIER_1 English(EN) · Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han L\`u, Leila Kosseim · 2026-05-29 04:00

规划方式重要吗？LLM网络代理规划表示的实证研究

arXiv:2605.29927v1 Announce Type: cross Abstract: Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planni…

arXiv cs.AI TIER_1 English(EN) · Xiang Liu, Sa Song, Zhaowei Zhang, Huiying Lan, Jason Zeng, Ming Wu, Michael Heinrich, Yong Sun, Ceyao Zhang · 2026-05-29 04:00

Agora：迈向生产级共识协议中具有LLM代理的自主漏洞检测

arXiv:2605.29910v1 Announce Type: cross Abstract: Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM-based approaches show promise in code analysis, they struggle with d…

arXiv cs.AI TIER_1 English(EN) · Francisco Le\'on Z\'u\~niga Bol\'ivar (Instituci\'on Universitaria Colegio Mayor del Cauca) · 2026-05-29 04:00

下一代LLM智能体系统中合作的演化动力学：一项跨提供商的实证扩展

arXiv:2605.29874v1 Announce Type: cross Abstract: Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a ben…

arXiv cs.AI TIER_1 English(EN) · Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu, Hong Wang, Xiankun Lin, Qiang Lin, Can Wang, Hande Dong, Jiawei Chen · 2026-05-29 04:00

团队协作：基于LLM的多智能体系统的协同自进化

arXiv:2605.29790v1 Announce Type: cross Abstract: LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eli…

arXiv cs.AI TIER_1 English(EN) · Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen · 2026-05-29 04:00

SkillBrew：为大语言模型代理进行多目标技能库策展

arXiv:2605.29440v1 Announce Type: cross Abstract: Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fa…

arXiv cs.AI TIER_1 English(EN) · Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang · 2026-05-29 04:00

DynSess：角色扮演代理的动态会话级评估与优化框架

arXiv:2605.29256v1 Announce Type: cross Abstract: Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimizati…

arXiv cs.AI TIER_1 English(EN) · Abel Yagubyan · 2026-05-29 04:00

LLM 代理的一致性如何？衡量多步工具调用管道中的行为可复现性

arXiv:2605.28840v1 Announce Type: cross Abstract: Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We pre…

arXiv cs.AI TIER_1 English(EN) · Hao-Xiang Xu, Chong Deng, Jiaqing Liu, Wen Wang, Qian Chen, Lujia Bao, Xiangang Li, Zhen-Hua Ling · 2026-05-29 04:00

GenesisFunc：用于准确且可泛化的函数调用的多智能体数据生成

arXiv:2605.28835v1 Announce Type: cross Abstract: Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-call…

arXiv cs.AI TIER_1 English(EN) · Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Ai… · 2026-05-29 04:00

AgentSchool：一个由LLM驱动的多智能体教育模拟器

arXiv:2605.30144v1 Announce Type: new Abstract: Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world…

arXiv cs.AI TIER_1 English(EN) · Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han · 2026-05-29 04:00

冗余还是必要？用于检测Agent轨迹中冗余步骤的基准测试

arXiv:2605.29893v1 Announce Type: new Abstract: LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agen…

arXiv cs.AI TIER_1 English(EN) · Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie, Zhehong Ai, Na Zou, Yuqiang Li, Tianfan Fu · 2026-05-29 04:00

SkillsInjector：为LLM智能体动态构建技能上下文

arXiv:2605.29794v1 Announce Type: new Abstract: LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, s…

arXiv cs.AI TIER_1 English(EN) · Kevin Wang, Anna Th\"oni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Ya… · 2026-05-29 04:00

MINDGAMES：多智能体大语言模型社交与策略推理的实时评估平台

arXiv:2605.29512v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes o…

arXiv cs.AI TIER_1 English(EN) · Shijie Cao, Yuan Yuan, Jing Liu · 2026-05-29 04:00

协调实时约束与长时程推理：一种动态调度的异步代理框架

arXiv:2605.29262v1 Announce Type: new Abstract: The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexib…

arXiv cs.AI TIER_1 English(EN) · Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li · 2026-05-29 04:00

PersonaAgent：连接记忆与行动，打造个性化LLM代理

arXiv:2506.06254v2 Announce Type: replace Abstract: Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-siz…

arXiv cs.AI TIER_1 English(EN) · Chelsea Zou, Yiheng Yao, Selena She, Noah Goodman, Robert D. Hawkins · 2026-05-29 04:00

CalBench：评估多智能体LLM中的协调-隐私权衡

arXiv:2605.09823v2 Announce Type: replace-cross Abstract: Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes the trust problem concrete: an assistant must coordinate with other assistants whi…

arXiv cs.AI TIER_1 English(EN) · Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han · 2026-05-29 04:00

OpenClawBench：对真实世界智能体执行轨迹中的进程侧异常进行基准测试

arXiv:2605.29253v1 Announce Type: new Abstract: Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or c…

arXiv cs.AI TIER_1 English(EN) · Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa · 2026-05-29 04:00

BenchTrace：用于测试 LLM Agent 反射能力和受控演进的基准测试

arXiv:2605.29225v1 Announce Type: new Abstract: Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offe…

arXiv cs.AI TIER_1 English(EN) · Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu · 2026-05-29 04:00

GTA：大规模生成Web代理的长期任务

arXiv:2605.29218v1 Announce Type: new Abstract: Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are …

arXiv cs.AI TIER_1 English(EN) · Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao · 2026-05-29 04:00

PRO-CUA：计算机使用代理的进程奖励优化

arXiv:2605.29119v1 Announce Type: new Abstract: Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered b…

arXiv cs.AI TIER_1 English(EN) · Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss · 2026-05-29 04:00

超越共识：Agent混合体中的痕量级合成

arXiv:2605.29116v1 Announce Type: new Abstract: When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggre…

arXiv cs.AI TIER_1 English(EN) · Tyler Akidau, Tyler Rockwood, Johannes Br\"uderl, Marc Millstone · 2026-05-29 04:00

安全自主代理对带外元数据的重要性：Redpanda Agentic Data Plane

arXiv:2605.29082v1 Announce Type: new Abstract: AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans -- prone to hallucination,…

arXiv cs.AI TIER_1 English(EN) · Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang · 2026-05-29 04:00

GroundAct：LLM代理能否将行动与环境状态联系起来？

arXiv:2508.05614v2 Announce Type: replace-cross Abstract: LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this ga…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

MineExplorer：评估MLLM智能体在Minecraft中的开放世界探索能力

MineExplorer benchmark evaluates multimodal large language models' open-world exploration capabilities in Minecraft through atomic and multi-hop tasks designed via multi-agent synthesis.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

掩盖陈旧观测有助于搜索代理——直到失效为止：一种状态图及其机制

Observation masking in long-horizon search agents shows variable accuracy gains depending on the interaction between retriever capability and model capacity, following an asymmetric inverted-U pattern.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

技能并非千篇一律：面向模型的技能对齐用于LLM智能体

Model-aware skill alignment framework adapts skills to different backbones through hierarchical evolution and lightweight rewriter training, achieving superior performance across interactive tasks.

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Dimosthenis Kyriazis · 2026-05-28 21:51

面向受监管网络安全运营的组织范围LLM代理运行时架构

Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report stron…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Ningyu Zhang · 2026-05-28 18:00

LongDS-Bench：关于长时域智能体数据分析失败的探讨

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn d…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Gennady Pekhimenko · 2026-05-28 17:54

SpecBench：评估软件工程LLM代理的规范级推理能力

Software engineering (SWE) agents are transitioning from code generation to full software development lifecycle automation. A critical phase in this lifecycle is specification design: transforming initial proposals into carefully considered requirements through expert review. Exi…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Shumin Deng · 2026-05-28 17:50

探索用于模型专业化的自主代理数据工程

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it un…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Xiangfeng Wang · 2026-05-28 16:05

AgentSchool：一个由LLM驱动的多智能体教育模拟器

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and ins…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Arash Behboodi · 2026-05-28 15:45

当云端代理遇上设备代理：混合多代理系统的经验教训

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which a…

arXiv cs.CL TIER_1 English(EN) · Leila Kosseim · 2026-05-28 13:39

规划方式是否重要？LLM网络代理规划表示的实证研究

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Francisco León Zúñiga Bolívar · 2026-05-28 12:58

下一代LLM代理系统中合作的演化动力学：一项跨提供商的实证扩展

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game t…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Jiawei Chen · 2026-05-28 11:40

团队进化：基于LLM的多智能体系统的协同自进化

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-dr…

arXiv cs.CL TIER_1 English(EN) · Wanxiang Che · 2026-05-28 09:45

Agent Harnesses 的扩展定律通过有效反馈计算

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expe…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Deyu Zhou · 2026-05-28 08:47

CONCAT：基于共识和置信度的临时组队，用于高效的基于LLM的多智能体系统

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research ha…

arXiv cs.IR (Information Retrieval) TIER_1 English(EN) · Qingsong Wen · 2026-05-28 06:33

SkillBrew：为大语言模型代理进行多目标技能库策展

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without remo…

arXiv cs.AI TIER_1 English(EN) · Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen · 2026-05-28 04:00

使用句子级纠错防御基于LLM的多智能体系统的协同攻击

arXiv:2605.28104v1 Announce Type: new Abstract: Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinfo…

arXiv cs.AI TIER_1 English(EN) · Md Hafizur Rahman, Zafaryab Haider, Tanzim Mahfuz, Prabuddha Chakraborty · 2026-05-28 04:00

HARP：衡量多智能体LLM系统中危害的放大

arXiv:2605.27489v1 Announce Type: cross Abstract: Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves interpretability, but creates a propagation risk: a bounded perturbation to one component can b…

arXiv cs.AI TIER_1 English(EN) · Susanna Cifani, Mario Luca Bernardi, Marta Cimitile · 2026-05-28 04:00

用于自动工作流执行的自适应多模态代理框架

arXiv:2605.28607v1 Announce Type: new Abstract: Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While t…

arXiv cs.AI TIER_1 English(EN) · Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichert · 2026-05-28 04:00

TASTE 的重要性：改进智能体基准的覆盖率和难度

arXiv:2605.28556v1 Announce Type: new Abstract: As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in …

arXiv cs.AI TIER_1 English(EN) · Liang Cheng, Mingsheng Cai, Jiuming Jiang, Luo Mai · 2026-05-28 04:00

智能体是否知道自己不能做什么？评估工具使用智能体的可行性认知

arXiv:2605.28532v1 Announce Type: new Abstract: Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios, many tasks become infeasible under constrained tool environments, where the capabilities req…

arXiv cs.AI TIER_1 English(EN) · Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin, Dongdong Ge, Yinyu Ye · 2026-05-28 04:00

OR-Space：工业优化代理的全生命周期工作空间基准

arXiv:2605.28158v1 Announce Type: new Abstract: Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement in…

arXiv cs.AI TIER_1 English(EN) · Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu · 2026-05-28 04:00

多智能体系统中智能体偏见放大与抑制的考察

arXiv:2605.28098v1 Announce Type: new Abstract: Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preserva…

arXiv cs.AI TIER_1 English(EN) · Zhenyu Cui, Xiangzhong Luo · 2026-05-28 04:00

智能体思考更深入吗？对顺序规划中逐层动态的机制性研究

arXiv:2605.27935v1 Announce Type: new Abstract: Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn pl…

arXiv cs.AI TIER_1 English(EN) · Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang · 2026-05-28 04:00

Harness-Bench：在真实的代理工作流中衡量模型间的 Harness 效应

arXiv:2605.27922v1 Announce Type: new Abstract: LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system lay…

arXiv cs.AI TIER_1 English(EN) · Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang · 2026-05-28 04:00

SKILLC：通过对比信用分配学习LLM智能体中的自主技能内化

arXiv:2605.27899v1 Announce Type: new Abstract: Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training t…

arXiv cs.AI TIER_1 English(EN) · Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao · 2026-05-28 04:00

LLM 智能体能力评估的统一框架

arXiv:2605.27898v1 Announce Type: new Abstract: As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each bench…

arXiv cs.AI TIER_1 English(EN) · Thao Nguyen, Heng Ji · 2026-05-28 04:00

MolLingo：分子原生表示用于LLM驱动的科学代理

arXiv:2605.27853v1 Announce Type: new Abstract: We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools o…

arXiv cs.AI TIER_1 English(EN) · Yi Ding, Zijie Xuan, Haowei Zhou, Zhenyu Ju, Xiaoxiao Dong, Jingwen Zhang, Xingyu Zhu, Leixin Sun, Haochi Zhang · 2026-05-28 04:00

TCP-MCP：多智能体系统的提示和通信拓扑的景观引导协同演化

arXiv:2605.27850v1 Announce Type: new Abstract: Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the…

arXiv cs.AI TIER_1 English(EN) · Lu Yan, Xuan Chen, Xiangyu Zhang · 2026-05-28 04:00

使用观察到的解决配置诊断大型语言模型代理中的实时策略内指令冲突

arXiv:2605.27784v1 Announce Type: new Abstract: LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a sin…

arXiv cs.AI TIER_1 English(EN) · Aman Priyanshu, Supriti Vijay, Esha Pahwa · 2026-05-28 04:00

有秘密？LLM 代理守不住：评估多代理系统中的隐私

arXiv:2605.27766v1 Announce Type: new Abstract: LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousan…

arXiv cs.AI TIER_1 English(EN) · Rui Zhang, Chaeeun Kim, Liting Hu · 2026-05-28 04:00

面向Agentic LLM服务的策略驱动运行时层

arXiv:2605.27744v1 Announce Type: new Abstract: Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-lev…

arXiv cs.AI TIER_1 English(EN) · Xijie Zeng, Frank Rudzicz · 2026-05-28 04:00

竞争性LLM代理中的秘密工具自愿串通

arXiv:2605.27593v1 Announce Type: new Abstract: Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenome…

arXiv cs.AI TIER_1 English(EN) · Shijie Cao, Yuan Yuan, Jing Liu · 2026-05-28 04:00

DynaSchedBench：基于LLM的调度代理的校准动态调度基准和可观测性悖论

arXiv:2605.27566v1 Announce Type: new Abstract: Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generato…

arXiv cs.LG TIER_1 English(EN) · Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei, Yifan Wu · 2026-05-28 04:00

隐匿的攻击：揭秘潜藏于基于潜变量的多智能体系统中的攻击

arXiv:2605.28214v1 Announce Type: cross Abstract: Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent sp…

arXiv cs.CL TIER_1 (CA) · Xinze Li, Yuhang Zang, Yixin Cao, Aixin Sun · 2026-05-28 04:00

Skill-as-Pseudocode：将技能库重构为LLM代理的伪代码

arXiv:2605.27955v1 Announce Type: cross Abstract: Markdown skill libraries for LLM agents ship as free-form prose, forcing the agent to re-derive both the input schema and the concrete invocation syntax on every retrieval. We observe that this often produces a "confused -> re-ret…

arXiv cs.CL TIER_1 English(EN) · Seunghyuk Cho, Sunghyun Choi, Jaeseung Heo, Youngbin Choi, Saemi Moon, MoonJeong Park, Dongwoo Kim · 2026-05-28 04:00

图书管理员万岁！面向节能多智能体软件工程系统的持久化搜索子代理

arXiv:2605.27787v1 Announce Type: cross Abstract: Multi-agent systems (MAS) have substantially advanced autonomous software engineering (SWE), but their growing inference energy demands raise sustainability concerns. In this paper, we demonstrate that this cost is concentrated in…

arXiv cs.CL TIER_1 English(EN) · Mingyu Lu, Yushan Huang, Chris Lin, Su-In Lee · 2026-05-28 04:00

至关重要的智能体：通过移除式归因优化多智能体LLM

arXiv:2605.27621v1 Announce Type: cross Abstract: As multi-agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignmen…

arXiv cs.CL TIER_1 English(EN) · Nicole Hsing, Asuka Yuxi Zheng, Yi Zhao, Haoqin Tu, Jen-Tse Huang · 2026-05-28 04:00

仅需一次对齐：通过种子智能体在多智能体系统中传播合作行为

arXiv:2605.27586v1 Announce Type: cross Abstract: Ensuring agent behaviors in distributed open multi-agent systems remains challenging, especially as populations grow and unaligned agents may exist. We show that a single aligned agent can propagate cooperative behaviors to untrai…

arXiv cs.CL TIER_1 English(EN) · Jihyeong Park, Ingeol Baek, Jeonghyun Park, Hwanhee Lee · 2026-05-28 04:00

超越单一路径：评估和增强交互式LLM智能体的发散性思维

arXiv:2605.28465v1 Announce Type: new Abstract: Divergent thinking is a core dimension of creativity, yet existing evaluations of Large Language Models (LLMs) treat them as single-turn text generations, failing to capture how an agent reasons through iterative interaction. To add…

arXiv cs.CL TIER_1 English(EN) · Ling-Yue Ge, Lan-Zhe Guo · 2026-05-28 04:00

Rails中的角色：多智能体结构化推理中的合同保留角色演化

arXiv:2605.28433v1 Announce Type: new Abstract: Role-based LLM multi-agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, va…

arXiv cs.CL TIER_1 English(EN) · Bin Wu, Guanyun Zou, Bingbing Wang, Huan Zhao, Chuan Shi · 2026-05-28 04:00

即时提问，稍后使用：评估长周期LLM智能体前瞻性差距的基准测试

arXiv:2605.28108v1 Announce Type: new Abstract: A long-lived LLM agent, such as OpenClaw, earns its value by acting on a user's preferences and constraints across sessions, not just the current request. Yet today's agents keep what a user volunteers but rarely ask for what stays …

arXiv cs.AI TIER_1 English(EN) · Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian · 2026-05-28 04:00

Colosseum：审计合作多智能体系统中的共谋行为

arXiv:2602.15198v2 Announce Type: replace-cross Abstract: Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a co…

arXiv cs.AI TIER_1 English(EN) · Ankush Kadu, Aswanth Krishnan · 2026-05-28 04:00

ReflexGrad：通过进度门控双进程路由实现 LLM Agent 的院内失败恢复

arXiv:2511.14584v3 Announce Type: replace-cross Abstract: We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents commit to a wrong approach early and exhaust the step budget, the post-failure trajec…

arXiv cs.AI TIER_1 English(EN) · Tianshi Xu, Huifeng Wen, Meng Li · 2026-05-28 04:00

适配接口而非模型：确定性LLM代理的运行时适配器

arXiv:2605.22166v2 Announce Type: replace Abstract: LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execution, feedback interpretation, and trajectory control. While existing agent adaptation met…

arXiv cs.AI TIER_1 English(EN) · Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge · 2026-05-28 04:00

EngiAI：一个用于LLM驱动的工程设计的多个智能体框架和基准套件

arXiv:2605.19743v2 Announce Type: replace Abstract: Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing prepa…

arXiv cs.AI TIER_1 English(EN) · Hanqing Yang, Narjes Nourzad, Shiyu Chen, Marie Siew, Jingdi Chen, Carlee Joe-Wong · 2026-05-28 04:00

COOP$^2$：在LLM多智能体系统中定义、观察和修复合作

arXiv:2603.00349v2 Announce Type: replace Abstract: Many complex tasks require extended effort, diverse capabilities, or coordinated actions beyond what a single agent can provide. However, simply adding more agents does not guarantee better performance, as effective cooperation …

arXiv cs.AI TIER_1 English(EN) · Hanqing Yang, Hyungwoo Lee, Yuhang Yao, Zhiwei Liu, Kay Liu, Jingdi Chen, Carlee Joe-Wong · 2026-05-28 04:00

DIG to Heal：通过可解释的动态决策路径扩展通用代理协作

arXiv:2603.00309v2 Announce Type: replace Abstract: The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems reduce complexity…

arXiv cs.AI TIER_1 English(EN) · Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong · 2026-05-28 04:00

SynthTools：用于代理开发的合成工具扩展框架

arXiv:2511.09572v2 Announce Type: replace Abstract: For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use environments. We introduce SynthTools, a fully LLM-based pipeline spanning the entire lifec…

arXiv cs.AI TIER_1 English(EN) · Suji Kim, Kangsan Kim, Sung Ju Hwang · 2026-05-28 04:00

从弱点中学习：小型计算机使用代理的自动化领域专业化

arXiv:2605.28775v1 Announce Type: cross Abstract: Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but th…

arXiv cs.AI TIER_1 English(EN) · Luca Beurer-Kellner, Aleksei Kudrinskii, Marco Milanta, Kristian Bonde Nielsen, Hemang Sarkar, Liran Tal · 2026-05-28 04:00

技术报告：探索Agent技能生态系统的新兴威胁

arXiv:2605.28588v1 Announce Type: cross Abstract: We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor installation, and data exfiltration. 13.4% of all skills contain at least one critical-level…

arXiv cs.AI TIER_1 English(EN) · Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang · 2026-05-28 04:00

SNARE：用于引发编码代理过度积极行为的自适应场景合成

arXiv:2605.28122v1 Announce Type: cross Abstract: A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adver…

arXiv cs.AI TIER_1 English(EN) · Swanand Rao · 2026-05-28 04:00

Tool Forge：用于受管代理执行的验证携带工具链

arXiv:2605.28000v1 Announce Type: cross Abstract: Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is…

arXiv cs.AI TIER_1 English(EN) · Cheng Qian, Jiayu Liu, Heng Ji · 2026-05-28 04:00

UserHarness: 利用用户思维构建更强的Agent心智理论

arXiv:2605.27721v1 Announce Type: cross Abstract: Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user's perspective. Ho…

arXiv cs.AI TIER_1 English(EN) · Yipeng Ouyang, Xin Huang, Bingjie Liu, Zhongchun Zheng, Yuhao Gu, Xianwei Zhang · 2026-05-28 04:00

基准测试不足以：用于生产系统中智能体模型的运行时评估的RAMP

arXiv:2605.27492v1 Announce Type: cross Abstract: LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to…