PulseAugur
EN
LIVE 03:28:19

New research probes LLM reasoning, instruction following, and self-correction

Several recent research papers explore the internal mechanisms and reasoning capabilities of Large Reasoning Models (LRMs). One paper, since withdrawn, proposed Entropy-Gradient Inversion and a related optimization technique (CorR-PO) to correlate token entropy with logit gradients for improved reasoning. Another withdrawn paper, LambdaPO, aimed to enhance reinforcement learning alignment by re-conceptualizing advantage estimation for finer-grained preference signals. A third paper introduced Convex Compositional Energy Minimization (CCEM) to address non-convexity in compositional reasoning models, enabling transfer to larger problem instances. Finally, a study on the "hidden critique ability" in LRMs identified a "critique vector" that can improve error detection and self-correction without additional training. AI

IMPACT New research explores methods to improve LLM reasoning, instruction following, and self-correction capabilities, potentially leading to more reliable and controllable AI systems.

RANK_REASON Multiple arXiv papers detailing new methods and analyses for large reasoning models.

Read on Hugging Face Daily Papers →

AI-generated summary · Google Gemini · from 8 sources. How we write summaries →

New research probes LLM reasoning, instruction following, and self-correction

COVERAGE [8]

  1. arXiv cs.AI TIER_1 English(EN) · Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu ·

    Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

    arXiv:2605.17770v2 Announce Type: replace Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in c…

  2. arXiv cs.CL TIER_1 English(EN) · Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao ·

    LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

    arXiv:2605.19416v2 Announce Type: replace Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajec…

  3. arXiv cs.LG TIER_1 English(EN) · Meir Roketlishvili, Semyon Semenov, Maksim Bobrin, Viktor Kovalchuk, Albert Baichorov, Abduragim Shtanchaev, Fakhri Karray, Dmitry V. Dylov, Martin Tak\'a\v{c}, Arip Asadulaev ·

    Convex Compositional Reasoning Models

    arXiv:2605.23395v1 Announce Type: new Abstract: Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is …

  4. arXiv cs.LG TIER_1 English(EN) · Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan ·

    Decoding the Critique Mechanism in Large Reasoning Models

    arXiv:2603.16331v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothe…

  5. arXiv cs.LG TIER_1 English(EN) · Arip Asadulaev ·

    Convex Compositional Reasoning Models

    Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geome…

  6. Hugging Face Daily Papers TIER_1 English(EN) ·

    Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

    Equilibrium Reasoners enable scalable reasoning through task-conditioned attractors that guide latent dynamical systems toward valid solutions, achieving significant accuracy improvements through iterative test-time computation.

  7. Hugging Face Daily Papers TIER_1 English(EN) ·

    LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

    Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a mo…

  8. Together AI blog TIER_1 English(EN) ·

    Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

    ReasonIF finds frontier LRMs fail to follow reasoning instructions >75% of the time; introduces a benchmark across languages, formatting, and length.