English(EN) The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

新基准揭示LLM在深度扩展下面临推理衰减

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-30 04:00

研究人员引入了复杂性天花板基准（CCB），以评估语言模型在任务深度增加时序列推理能力如何下降。在涉及五个前沿和开源LLM的六千次试验中，该基准显示随着序列步骤的增加，性能呈一致的几何衰减。虽然顶级模型在多达50个步骤的空间状态跟踪和符号操作任务上保持高准确率，但在传递关系推理任务上的表现却急剧下降，最好的模型在约4.7个步骤时仅达到50%的成功率。研究还发现，很大一部分正确答案是通过不正确的中间推理获得的，并且推理首次出现分歧的平均步数比模型参数数量更能预测长时推理性能。 AI

影响该基准提供了一种评估LLM推理深度的新方法，可能指导未来模型开发朝着更鲁棒的序列处理方向发展。

排序理由这是一篇介绍LLM能力新基准的研究论文。[lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.AI TIER_1 English(EN) · Shubh Chapra, Dhruv Kumar, Murari Mandal, Yash Sinha · 2026-06-30 04:00

The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

arXiv:2606.29278v1 Announce Type: new Abstract: We introduce the Complexity Ceiling Benchmark (CCB), a controlled evaluation of how language-model reasoning decays as the number of required sequential steps grows. CCB fixes the semantic content of a task and varies only its depth…

报道来源 [1]

The Complexity Ceiling Benchmark: A Multi-Domain Evaluation of Sequential Reasoning Under Depth Scaling

相关实体

相关话题