I made Claude, GPT and Gemini predict the entire 2026 World Cup. Here's the experiment design.
An experiment was conducted to benchmark three leading LLMs—Claude Opus 4.8, GPT-5.2, and Gemini 3.1 Pro—on their ability to predict the 2026 World Cup. The models were tested under three conditions: using only their internal knowledge, with access to web browsing, and with a standardized dataset of FIFA rankings and Elo ratings. This rigorous design aimed to isolate whether performance differences stemmed from the models' inherent knowledge or their data retrieval and processing capabilities. The experiment revealed inconsistencies in model predictions based on the information provided, with GPT-5.2 exhibiting peculiar behavior like inventing football rules and Claude misinterpreting schema documentation. AI
IMPACT This experiment highlights LLM limitations in consistency and adherence to rules, suggesting a need for improved prompt engineering and data handling for complex predictive tasks.