Researchers have introduced the Complexity Ceiling Benchmark (CCB) to evaluate how language models' sequential reasoning abilities degrade with increasing task depth. Across six thousand trials involving five frontier and open-weight LLMs, the benchmark revealed a consistent geometric decay in performance as the number of sequential steps increased. While top models maintained high accuracy on spatial state-tracking and symbolic manipulation tasks up to 50 steps, their performance collapsed on transitive relational inference tasks, with the best models achieving only a 50% success rate at around 4.7 steps. The study also found that a significant portion of correct answers were achieved through incorrect intermediate reasoning, and that the mean step at which reasoning first diverges is a better predictor of long-horizon reasoning performance than model parameter count. AI
IMPACT This benchmark provides a new method for evaluating LLM reasoning depth, potentially guiding future model development towards more robust sequential processing.
RANK_REASON This is a research paper introducing a new benchmark for evaluating LLM capabilities. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →