New research probes LLM reasoning, instruction following, and self-correction
ByPulseAugur Editorial·[25 sources]·
Several recent research papers explore the internal mechanisms and reasoning capabilities of Large Reasoning Models (LRMs). One paper, since withdrawn, proposed Entropy-Gradient Inversion and a related optimization technique (CorR-PO) to correlate token entropy with logit gradients for improved reasoning. Another withdrawn paper, LambdaPO, aimed to enhance reinforcement learning alignment by re-conceptualizing advantage estimation for finer-grained preference signals. A third paper introduced Convex Compositional Energy Minimization (CCEM) to address non-convexity in compositional reasoning models, enabling transfer to larger problem instances. Finally, a study on the "hidden critique ability" in LRMs identified a "critique vector" that can improve error detection and self-correction without additional training.
AI
IMPACT
New research explores methods to improve LLM reasoning, instruction following, and self-correction capabilities, potentially leading to more reliable and controllable AI systems.
RANK_REASON
Multiple arXiv papers detailing new methods and analyses for large reasoning models.
arXiv:2511.20236v3 Announce Type: replace Abstract: Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to achieve a desired outcome. However, existing methods often neglect dependencies among feat…
arXiv:2602.01914v2 Announce Type: replace Abstract: Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face tw…
arXiv:2601.14249v5 Announce Type: replace Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not n…
arXiv cs.CL
TIER_1English(EN)·Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei·
arXiv:2508.19988v3 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills t…
arXiv cs.CL
TIER_1English(EN)·Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore·
arXiv:2605.25745v1 Announce Type: new Abstract: Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a p…
arXiv:2605.25443v1 Announce Type: new Abstract: Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-…
arXiv cs.CL
TIER_1Norsk(NO)·Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Leszek Rutkowski, Dacheng Tao·
arXiv:2605.24998v1 Announce Type: new Abstract: Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short i…
arXiv:2602.18956v3 Announce Type: replace Abstract: We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical …
arXiv cs.AI
TIER_1English(EN)·Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Sch\"olkopf, Ryan Cotterell·
arXiv:2605.24597v1 Announce Type: new Abstract: Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is t…
arXiv cs.AI
TIER_1English(EN)·Andrew Corbett, Archit Sood, Anna Tzatzopoulou, Sai-Aakash Ramesh, Tim Dodwell·
arXiv:2605.25230v1 Announce Type: new Abstract: Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the in…
arXiv:2605.25354v1 Announce Type: new Abstract: While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific con…
arXiv:2605.24873v1 Announce Type: cross Abstract: Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training g…
arXiv cs.AI
TIER_1English(EN)·Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltr\'an, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, And…·
arXiv:2509.12194v2 Announce Type: replace Abstract: Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, fe…
arXiv:2511.15407v4 Announce Type: replace Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more exp…
Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reaso…
arXiv:2605.17770v2 Announce Type: replace Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in c…
arXiv cs.LG
TIER_1English(EN)·Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan·
arXiv:2603.16331v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothe…
arXiv cs.LG
TIER_1English(EN)·Meir Roketlishvili, Semyon Semenov, Maksim Bobrin, Viktor Kovalchuk, Albert Baichorov, Abduragim Shtanchaev, Fakhri Karray, Dmitry V. Dylov, Martin Tak\'a\v{c}, Arip Asadulaev·
arXiv:2605.23395v1 Announce Type: new Abstract: Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is …
arXiv:2605.19416v2 Announce Type: replace Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajec…
Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geome…
Large Reasoning Models demonstrate hidden critique abilities that allow error recovery through internal mechanisms, identified via interpretable critique vectors that enhance error detection without additional training.
Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a mo…
arXiv cs.CV
TIER_1English(EN)·Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun·
arXiv:2605.25437v1 Announce Type: new Abstract: Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of inform…