Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

Researchers have introduced RetailBench, a new benchmark designed to evaluate the long-horizon reasoning and decision-making capabilities of LLM agents in realistic retail environments. The benchmark simulates supermarket operations over extended periods, requiring agents to manage various aspects like pricing, inventory, and customer feedback. Evaluations of seven LLMs showed significant variation in performance, with only a few surviving the full simulation horizon and all falling short of an oracle policy in terms of net worth and sales. AI

IMPACT This benchmark will help researchers develop more capable LLM agents for complex, long-term tasks.

Hugging Face
arXiv
DagsHub
alphaXiv
ScienceCast
CatalyzeX
Gotit.pub
RetailBench