English(EN) WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

新的基准测试 WorkflowPerturb 对多智能体 LLM 评估指标进行压力测试

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-16 04:00

研究人员推出了一款名为 WorkflowPerturb 的新基准测试，旨在对多智能体 LLM 系统的评估指标进行压力测试。该基准测试包含超过 4,900 个黄金工作流和近 45,000 个扰动变体，涵盖三种类型的更改：缺失步骤、压缩步骤和描述更改。目标是提高指标分数的校准性和可解释性，使工程师能够更好地评估生产环境中更改的安全性。 AI

影响改进了多智能体 LLM 系统的评估，有助于安全部署和变更管理。

排序理由该集群包含一篇详细介绍用于评估多智能体 LLM 系统的新基准测试的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Madhav Kanda, Sharad Agarwal, Rodrigo Fonseca, Alok Gautam Kumbhare, Pedro Las-Casas · 2026-06-16 04:00

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

arXiv:2602.17990v2 Announce Type: replace Abstract: Multi-agent LLM systems that generate structured workflows from natural-language requests are now deployed in production across cloud automation, DevOps, and enterprise process orchestration. Operating such systems exposes a rec…

报道来源 [1]

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

相关实体

相关话题