Researchers from the Max Planck Institute have introduced FutureSim, a new benchmark designed to evaluate AI agents' ability to predict real-world events using only historical web data. This method prevents agents from accessing future information, simulating a more realistic forecasting scenario. Early tests using models like GPT-5.5 within the Codex harness showed strong performance on some markets, such as the Super Bowl, but struggled with others like UK elections and the Grammys, indicating narrow capabilities. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Tests AI agents' ability to forecast events using historical data, highlighting narrow capabilities beyond trivia.
RANK_REASON The cluster describes the release of a new academic benchmark for evaluating AI agents. [lever_c_demoted from research: ic=1 ai=1.0]