New Benchmark Reveals LLM Reasoning Decay Under Depth Scaling

By PulseAugur Editorial · [1 sources] · 2026-06-30 04:00

Researchers have introduced the Complexity Ceiling Benchmark (CCB) to evaluate how language models' sequential reasoning abilities degrade with increasing task depth. Across six thousand trials involving five frontier and open-weight LLMs, the benchmark revealed a consistent geometric decay in performance as the number of sequential steps increased. While top models maintained high accuracy on spatial state-tracking and symbolic manipulation tasks up to 50 steps, their performance collapsed on transitive relational inference tasks, with the best models achieving only a 50% success rate at around 4.7 steps. The study also found that a significant portion of correct answers were achieved through incorrect intermediate reasoning, and that the mean step at which reasoning first diverges is a better predictor of long-horizon reasoning performance than model parameter count. AI

IMPACT This benchmark provides a new method for evaluating LLM reasoning depth, potentially guiding future model development towards more robust sequential processing.

RANK_REASON This is a research paper introducing a new benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New Benchmark Reveals LLM Reasoning Decay Under Depth Scaling

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Shubh Chapra, Dhruv Kumar, Murari Mandal, Yash Sinha · 2026-06-30 04:00

The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

arXiv:2606.29278v1 Announce Type: new Abstract: We introduce the Complexity Ceiling Benchmark (CCB), a controlled evaluation of how language-model reasoning decays as the number of required sequential steps grows. CCB fixes the semantic content of a task and varies only its depth…

COVERAGE [1]

The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

RELATED ENTITIES

RELATED TOPICS