ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
A new benchmark called ORAgentBench has been introduced to evaluate the capabilities of large language model (LLM) agents in performing complex operations research (OR) tasks. The benchmark includes 107 human-reviewed tasks across various operational scenarios, each with specific data, configurations, and submission requirements. Initial experiments using fourteen frontier agent-model configurations revealed that current LLM agents are not yet reliable for practical OR work, with the best-performing agent succeeding on only 35.51% of tasks and struggling with feasibility and solution quality. AI
IMPACT Highlights the current limitations of LLM agents in complex, real-world problem-solving domains like operations research.