English(EN) RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

新的基准测试 RetailBench 评估大型语言模型代理的长期决策能力

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-16 04:00

研究人员推出了 RetailBench，这是一个旨在评估大型语言模型代理在真实零售环境中进行长周期推理和决策制定能力的新基准测试。该基准测试模拟了超市在较长时期内的运营，要求代理管理定价、库存和客户反馈等各个方面。对七个大型语言模型的评估显示出显著的性能差异，只有少数模型能够完成整个模拟周期，并且在净资产和销售额方面都未能达到最优策略。 AI

影响该基准测试将有助于研究人员开发更强大的大型语言模型代理，以应对复杂、长期的任务。

排序理由该集群包含一篇详细介绍大型语言模型代理新基准测试的学术论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang · 2026-06-16 04:00

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

arXiv:2606.15862v1 Announce Type: new Abstract: Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data…

报道来源 [1]

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

相关实体

相关话题