Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Researchers have introduced MBABench, a new benchmark designed to evaluate Large Language Model (LLM) agents on complex, end-to-end spreadsheet tasks relevant to the finance industry. The benchmark assesses agents on their ability to create complete spreadsheets for financial modeling, forecasting, and scenario analysis, focusing on accuracy, formula quality, and formatting. While Anthropic's Claude family of models performed best, even top-performing agents struggled to consistently meet professional finance standards, particularly as task complexity increased, indicating current LLM agents are not yet ready for demanding real-world financial workflows. AI

IMPACT Highlights limitations in current LLM agent capabilities for complex financial tasks, suggesting a need for further development before widespread enterprise adoption in this domain.

Anthropic
Claude
LLM agents
MBABench