English(EN)LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
新研究探讨LLM的推理、指令遵循和自我纠正能力
作者PulseAugur 编辑部·[8 个来源]·
几篇最新的研究论文探讨了大型推理模型(LRM)的内部机制和推理能力。其中一篇已被撤回的论文提出了熵梯度反演(Entropy-Gradient Inversion)及其相关优化技术(CorR-PO),通过关联词元熵与logit梯度来改进推理。另一篇被撤回的论文LambdaPO,旨在通过重新构想优势估计以获得更细粒度的偏好信号,从而增强强化学习的对齐。第三篇论文引入了凸组合能量最小化(Convex Compositional Energy Minimization, CCEM)来解决组合推理模型中的非凸性问题,使其能够迁移到更大的问题实例。最后,一项关于LRM中“隐藏的批评能力”的研究,识别出一个“批评向量”,可以在无需额外训练的情况下提高错误检测和自我纠正能力。
AI
arXiv:2605.17770v2 Announce Type: replace Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in c…
arXiv:2605.19416v2 Announce Type: replace Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajec…
arXiv cs.LG
TIER_1English(EN)·Meir Roketlishvili, Semyon Semenov, Maksim Bobrin, Viktor Kovalchuk, Albert Baichorov, Abduragim Shtanchaev, Fakhri Karray, Dmitry V. Dylov, Martin Tak\'a\v{c}, Arip Asadulaev·
arXiv:2605.23395v1 Announce Type: new Abstract: Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is …
arXiv cs.LG
TIER_1English(EN)·Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan·
arXiv:2603.16331v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothe…
Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geome…
Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a mo…