PulseAugur
EN
LIVE 21:51:49

New Benchmark Reveals LLM Reasoning Decay Under Depth Scaling

Researchers have introduced the Complexity Ceiling Benchmark (CCB) to evaluate how language models' sequential reasoning abilities degrade with increasing task depth. Across six thousand trials involving five frontier and open-weight LLMs, the benchmark revealed a consistent geometric decay in performance as the number of sequential steps increased. While top models maintained high accuracy on spatial state-tracking and symbolic manipulation tasks up to 50 steps, their performance collapsed on transitive relational inference tasks, with the best models achieving only a 50% success rate at around 4.7 steps. The study also found that a significant portion of correct answers were achieved through incorrect intermediate reasoning, and that the mean step at which reasoning first diverges is a better predictor of long-horizon reasoning performance than model parameter count. AI

IMPACT This benchmark provides a new method for evaluating LLM reasoning depth, potentially guiding future model development towards more robust sequential processing.

RANK_REASON This is a research paper introducing a new benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New Benchmark Reveals LLM Reasoning Decay Under Depth Scaling

COVERAGE [1]

  1. arXiv cs.AI TIER_1 English(EN) · Shubh Chapra, Dhruv Kumar, Murari Mandal, Yash Sinha ·

    The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

    arXiv:2606.29278v1 Announce Type: new Abstract: We introduce the Complexity Ceiling Benchmark (CCB), a controlled evaluation of how language-model reasoning decays as the number of required sequential steps grows. CCB fixes the semantic content of a task and varies only its depth…