PulseAugur
EN
LIVE 07:56:48

More capable LLMs make worse forecasts on specific risk-heavy tasks

A new research paper introduces ForecastBench-Sim (FBSim), a benchmark designed to evaluate language models on forecasting tasks with superlinear growth and regime change risks. The study found that more capable language models, including Llama-3.1, tend to produce worse distributional forecasts on these specific types of problems. This inverse scaling effect, where increased capability leads to decreased accuracy in certain scenarios, was observed across simulated epidemics and real-world data from finance and public health. AI

IMPACT Highlights a potential limitation in LLM forecasting capabilities, suggesting current evaluation metrics may mask performance issues in high-risk scenarios.

RANK_REASON The cluster contains a new academic paper detailing a novel benchmark and findings about LLM performance.

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

  1. arXiv cs.AI TIER_1 English(EN) · Nick Merrill, Jaeho Lee, Ezra Karger ·

    Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

    arXiv:2605.22672v2 Announce Type: replace Abstract: We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable mod…

  2. arXiv cs.AI TIER_1 English(EN) · Ezra Karger ·

    Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

    We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The patt…