English(EN) LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

新研究探讨LLM的推理、指令遵循和自我纠正能力

作者 PulseAugur 编辑部 · [8 个来源] · 2025-10-22 00:00

几篇最新的研究论文探讨了大型推理模型（LRM）的内部机制和推理能力。其中一篇已被撤回的论文提出了熵梯度反演（Entropy-Gradient Inversion）及其相关优化技术（CorR-PO），通过关联词元熵与logit梯度来改进推理。另一篇被撤回的论文LambdaPO，旨在通过重新构想优势估计以获得更细粒度的偏好信号，从而增强强化学习的对齐。第三篇论文引入了凸组合能量最小化（Convex Compositional Energy Minimization, CCEM）来解决组合推理模型中的非凸性问题，使其能够迁移到更大的问题实例。最后，一项关于LRM中“隐藏的批评能力”的研究，识别出一个“批评向量”，可以在无需额外训练的情况下提高错误检测和自我纠正能力。 AI

影响新研究探索了提高LLM推理、指令遵循和自我纠正能力的方法，有望带来更可靠、更可控的AI系统。

排序理由多篇arXiv论文详细介绍了大型推理模型的新方法和分析。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 8 个来源。我们如何撰写摘要 →

报道来源 [8]

arXiv cs.AI TIER_1 English(EN) · Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu · 2026-05-25 04:00

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

arXiv:2605.17770v2 Announce Type: replace Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in c…
arXiv cs.CL TIER_1 English(EN) · Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao · 2026-05-25 04:00

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

arXiv:2605.19416v2 Announce Type: replace Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajec…
arXiv cs.LG TIER_1 English(EN) · Meir Roketlishvili, Semyon Semenov, Maksim Bobrin, Viktor Kovalchuk, Albert Baichorov, Abduragim Shtanchaev, Fakhri Karray, Dmitry V. Dylov, Martin Tak\'a\v{c}, Arip Asadulaev · 2026-05-25 04:00

Convex Compositional Reasoning Models

arXiv:2605.23395v1 Announce Type: new Abstract: Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is …
arXiv cs.LG TIER_1 English(EN) · Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan · 2026-05-25 04:00

Decoding the Critique Mechanism in Large Reasoning Models

arXiv:2603.16331v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothe…
arXiv cs.LG TIER_1 English(EN) · Arip Asadulaev · 2026-05-22 09:04

Convex Compositional Reasoning Models

Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geome…
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 00:00

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Equilibrium Reasoners enable scalable reasoning through task-conditioned attractors that guide latent dynamical systems toward valid solutions, achieving significant accuracy improvements through iterative test-time computation.
Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-19 06:10

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a mo…
Together AI blog TIER_1 English(EN) · 2025-10-22 00:00

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

ReasonIF finds frontier LRMs fail to follow reasoning instructions >75% of the time; introduces a benchmark across languages, formatting, and length.

报道来源 [8]

相关实体

相关话题