A new benchmark called ORAgentBench has been introduced to evaluate the capabilities of large language model (LLM) agents in performing complex operations research (OR) tasks. The benchmark includes 107 human-reviewed tasks across various operational scenarios, each with specific data, configurations, and submission requirements. Initial experiments using fourteen frontier agent-model configurations revealed that current LLM agents are not yet reliable for practical OR work, with the best-performing agent succeeding on only 35.51% of tasks and struggling with feasibility and solution quality. AI
IMPACT Highlights the current limitations of LLM agents in complex, real-world problem-solving domains like operations research.
RANK_REASON Publication of a new academic benchmark paper on arXiv. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →