A new research paper introduces ForecastBench-Sim (FBSim), a benchmark designed to evaluate language models on forecasting tasks with superlinear growth and regime change risks. The study found that more capable language models, including Llama-3.1, tend to produce worse distributional forecasts on these specific types of problems. This inverse scaling effect, where increased capability leads to decreased accuracy in certain scenarios, was observed across simulated epidemics and real-world data from finance and public health. AI
IMPACT Highlights a potential limitation in LLM forecasting capabilities, suggesting current evaluation metrics may mask performance issues in high-risk scenarios.
RANK_REASON The cluster contains a new academic paper detailing a novel benchmark and findings about LLM performance.
AI-generated summary · Google Gemini · from 2 sources. How we write summaries →