Researchers have developed a new metric called Agentic Success Rate (ASR) to evaluate the workflow fidelity of LLM-based agent systems in payment processes. Traditional metrics like Task Success Rate (TSR) and Agent Handoff F1-Score (HF1) fail to detect critical deviations, such as skipping confirmation checkpoints. The ASR metric, applied to 18 LLMs and over 90,000 payment tasks, revealed that models like GPT-4.1 could achieve perfect scores on existing metrics while still exhibiting workflow shortcuts, whereas GPT-5.2 demonstrated perfect ASR. This new evaluation method has been shown to significantly improve task success rates, especially in regulated domains. AI
影响 Introduces a more robust evaluation metric for LLM agents, crucial for reliable deployment in sensitive workflows like payments.
排序理由 Academic paper introducing a new evaluation metric for LLM-based agent systems.
- Agent Handoff F1-Score
- Agentic Success Rate
- GPT-4.1
- GPT-5.2
- Hierarchical Multi-Agent System for Payments
- LLM
- Task Success Rate
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →