FutureSim benchmark tests AI forecasting with historical data

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers from the Max Planck Institute have introduced FutureSim, a new benchmark designed to evaluate AI agents' ability to predict real-world events using only historical web data. This method prevents agents from accessing future information, simulating a more realistic forecasting scenario. Early tests using models like GPT-5.5 within the Codex harness showed strong performance on some markets, such as the Super Bowl, but struggled with others like UK elections and the Grammys, indicating narrow capabilities. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Tests AI agents' ability to forecast events using historical data, highlighting narrow capabilities beyond trivia.

RANK_REASON The cluster describes the release of a new academic benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]

Read on dev.to — Claude Code tag →

COVERAGE [1]

dev.to — Claude Code tag TIER_1 · Simon Paxton · 2026-05-17 04:42

FutureSim Exposes Polymarket AI's Narrow Wins and Failures

<p>Max Planck Institute researchers recently released FutureSim, a benchmark for polymarket ai-style forecasting that tests whether agents can predict real-world events from a frozen slice of past web history rather than the live internet.</p> <p>According to the FutureSim projec…

COVERAGE [1]

FutureSim Exposes Polymarket AI's Narrow Wins and Failures

RELATED ENTITIES

RELATED TOPICS