New KellyBench benchmark reveals AI models fail sports betting markets

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 2 sources

Researchers have introduced KellyBench, a new benchmark designed to evaluate the long-horizon sequential decision-making capabilities of language models in dynamic environments. The benchmark simulates sports betting markets, specifically the English Premier League, challenging agents to maximize bankroll growth using historical data and public odds. Initial evaluations showed that even advanced models struggled, with the best performer losing 8% on average and many experiencing financial ruin, indicating a significant gap compared to human expert strategies. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights limitations of current frontier models in complex, dynamic environments, suggesting a need for improved adaptive strategies.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI models.

Read on arXiv cs.AI →

paper
other

COVERAGE [2]

arXiv cs.AI TIER_1 · Thomas Grady, Kip Parker, Iliyan Zarov, Henry Course, Chengxi Taylor, Ross Taylor · 2026-05-01 04:00

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

arXiv:2604.27865v1 Announce Type: new Abstract: Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBenc…
arXiv cs.AI TIER_1 · Ross Taylor · 2026-04-30 13:47

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential deci…

COVERAGE [2]

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

RELATED ENTITIES

RELATED TOPICS