WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
A new research paper introduces WorkstreamBench, a benchmark designed to evaluate Large Language Model (LLM) agents on complex, end-to-end spreadsheet tasks relevant to the finance industry. The benchmark assesses agents across accuracy, formula correctness, and output formatting, aiming to measure their ability to produce professional-quality financial models and forecasts. While Anthropic's Claude family of models performed best, even the leading agents struggled with tasks beyond simple calculations and frequently failed to meet professional finance standards, indicating a gap between current LLM agent capabilities and real-world enterprise demands. AI
IMPACT Highlights limitations of current LLM agents in performing complex, real-world financial tasks, indicating a need for further development in agent capabilities.