RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments
Researchers have introduced RetailBench, a new benchmark designed to evaluate the long-horizon reasoning and decision-making capabilities of LLM agents in realistic retail environments. The benchmark simulates supermarket operations over extended periods, requiring agents to manage various aspects like pricing, inventory, and customer feedback. Evaluations of seven LLMs showed significant variation in performance, with only a few surviving the full simulation horizon and all falling short of an oracle policy in terms of net worth and sales. AI
IMPACT This benchmark will help researchers develop more capable LLM agents for complex, long-term tasks.