An experiment was conducted to benchmark three leading LLMs—Claude Opus 4.8, GPT-5.2, and Gemini 3.1 Pro—on their ability to predict the 2026 World Cup. The models were tested under three conditions: using only their internal knowledge, with access to web browsing, and with a standardized dataset of FIFA rankings and Elo ratings. This rigorous design aimed to isolate whether performance differences stemmed from the models' inherent knowledge or their data retrieval and processing capabilities. The experiment revealed inconsistencies in model predictions based on the information provided, with GPT-5.2 exhibiting peculiar behavior like inventing football rules and Claude misinterpreting schema documentation. AI
IMPACT This experiment highlights LLM limitations in consistency and adherence to rules, suggesting a need for improved prompt engineering and data handling for complex predictive tasks.
RANK_REASON The cluster describes an experiment comparing LLM performance on a specific task, including methodology and observed behaviors, which aligns with research reporting. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →