Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

Researchers have introduced WorkflowPerturb, a new benchmark designed to stress-test evaluation metrics for multi-agent LLM systems. This benchmark includes over 4,900 golden workflows and nearly 45,000 perturbed variants across three types of changes: missing steps, compressed steps, and description changes. The goal is to improve the calibration and interpretability of metric scores, enabling engineers to better assess the safety of changes in production environments. AI

IMPACT Improves evaluation of multi-agent LLM systems, aiding safe deployment and change management.

Hugging Face
arXiv
DagsHub
alphaXiv
ScienceCast
CatalyzeX
Gotit.pub
WorkflowPerturb
Sharad Agarwal