PulseAugur
LIVE 08:07:03
research · [2 sources] ·
0
research

New KellyBench benchmark reveals AI models fail sports betting markets

Researchers have introduced KellyBench, a new benchmark designed to evaluate the long-horizon sequential decision-making capabilities of language models in dynamic environments. The benchmark simulates sports betting markets, specifically the English Premier League, challenging agents to maximize bankroll growth using historical data and public odds. Initial evaluations showed that even advanced models struggled, with the best performer losing 8% on average and many experiencing financial ruin, indicating a significant gap compared to human expert strategies. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Highlights limitations of current frontier models in complex, dynamic environments, suggesting a need for improved adaptive strategies.

RANK_REASON The cluster describes a new academic benchmark for evaluating AI models.

Read on arXiv cs.AI →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 · Thomas Grady, Kip Parker, Iliyan Zarov, Henry Course, Chengxi Taylor, Ross Taylor ·

    KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

    arXiv:2604.27865v1 Announce Type: new Abstract: Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBenc…

  2. arXiv cs.AI TIER_1 · Ross Taylor ·

    KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

    Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential deci…