Researchers have introduced KellyBench, a new benchmark designed to evaluate the long-horizon sequential decision-making capabilities of language models in dynamic environments. The benchmark simulates sports betting markets, specifically the English Premier League, challenging agents to maximize bankroll growth using historical data and public odds. Initial evaluations showed that even advanced models struggled, with the best performer losing 8% on average and many experiencing financial ruin, indicating a significant gap compared to human expert strategies. AI
Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →
IMPACT Highlights limitations of current frontier models in complex, dynamic environments, suggesting a need for improved adaptive strategies.
RANK_REASON The cluster describes a new academic benchmark for evaluating AI models.