Researchers have introduced OptiVerse, a new benchmark designed to evaluate Large Language Models (LLMs) on a wider range of optimization problems beyond traditional mathematical and combinatorial tasks. The benchmark includes 1,000 problems across domains like stochastic optimization and optimal control, with varying difficulty levels. Experiments showed that even advanced models such as GPT-5.2 and Gemini-3 struggled with harder problems, indicating that modeling and logic errors are significant limitations. To address this, a Dual-View Auditor Agent was proposed to enhance the LLM's modeling accuracy. AI
影响 Establishes a new evaluation standard for LLMs in complex optimization, potentially guiding future model development.
排序理由 This is a research paper introducing a new benchmark for evaluating LLMs on optimization problems.
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →