PulseAugur
EN
LIVE 12:06:55

New benchmark WorkflowPerturb stress-tests multi-agent LLM evaluation metrics

Researchers have introduced WorkflowPerturb, a new benchmark designed to stress-test evaluation metrics for multi-agent LLM systems. This benchmark includes over 4,900 golden workflows and nearly 45,000 perturbed variants across three types of changes: missing steps, compressed steps, and description changes. The goal is to improve the calibration and interpretability of metric scores, enabling engineers to better assess the safety of changes in production environments. AI

IMPACT Improves evaluation of multi-agent LLM systems, aiding safe deployment and change management.

RANK_REASON The cluster contains a research paper detailing a new benchmark for evaluating multi-agent LLM systems. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Madhav Kanda, Sharad Agarwal, Rodrigo Fonseca, Alok Gautam Kumbhare, Pedro Las-Casas ·

    WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

    arXiv:2602.17990v2 Announce Type: replace Abstract: Multi-agent LLM systems that generate structured workflows from natural-language requests are now deployed in production across cloud automation, DevOps, and enterprise process orchestration. Operating such systems exposes a rec…