Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
A new research paper introduces ForecastBench-Sim (FBSim), a benchmark designed to evaluate language models on forecasting tasks with superlinear growth and regime change risks. The study found that more capable language models, including Llama-3.1, tend to produce worse distributional forecasts on these specific types of problems. This inverse scaling effect, where increased capability leads to decreased accuracy in certain scenarios, was observed across simulated epidemics and real-world data from finance and public health. AI
IMPACT Highlights a potential limitation in LLM forecasting capabilities, suggesting current evaluation metrics may mask performance issues in high-risk scenarios.