AI research explores hierarchical reasoning, counterfactuals, and efficient training methods · 10 sources…

arXiv cs.AI TIER_1 English(EN) · Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu · 2026-06-19 04:00

Vero: An Open RL Recipe for General Visual Reasoning

arXiv:2604.04917v3 Announce Type: replace-cross Abstract: What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, …

arXiv cs.AI TIER_1 English(EN) · Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto · 2026-06-19 04:00

Reinforcement-aware Knowledge Distillation for LLM Reasoning

arXiv:2602.22495v3 Announce Type: replace-cross Abstract: Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller stud…

arXiv cs.AI TIER_1 English(EN) · Hoang Phan, Xianjun Yang, Yuanshun Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei · 2026-06-19 04:00

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

arXiv:2510.21978v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language m…

arXiv cs.AI TIER_1 English(EN) · Ya Wang, Adrian Paschke · 2026-06-19 04:00

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

arXiv:2606.19489v1 Announce Type: cross Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the nee…

arXiv cs.AI TIER_1 English(EN) · Saimun Habib, Vaishak Belle, Fengxiang He · 2026-06-19 04:00

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

arXiv:2606.20526v1 Announce Type: new Abstract: Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence.…

arXiv cs.AI TIER_1 English(EN) · Fengxiang He · 2026-06-18 17:39

DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs

Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference is associational. Counterfactual reasoning additionally requires a causal semantics for interventions and evidence. We introduce DeepSWIP, a single-world counterfa…

arXiv cs.AI TIER_1 English(EN) · Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson · 2026-06-18 04:00

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

arXiv:2503.01805v3 Announce Type: replace-cross Abstract: Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the mi…

arXiv cs.AI TIER_1 English(EN) · Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou · 2026-06-18 04:00

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

arXiv:2606.19222v1 Announce Type: cross Abstract: We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoi…

arXiv cs.CL TIER_1 English(EN) · Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang · 2026-06-18 04:00

Enhancing Multilingual Reasoning via Steerable Model Merging

arXiv:2606.19002v1 Announce Type: new Abstract: Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different m…

arXiv cs.CL TIER_1 English(EN) · Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun · 2026-06-18 04:00

GraphPO: Graph-based Policy Optimization for Reasoning Models

arXiv:2606.18954v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final an…

arXiv cs.CL TIER_1 English(EN) · Jihyung Park, Minchao Huang, Leqi Liu, Elias Stengel-Eskin · 2026-06-18 04:00

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

arXiv:2606.18624v1 Announce Type: new Abstract: Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still str…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-17 15:59

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the …

arXiv cs.AI TIER_1 English(EN) · Xu Zhou · 2026-06-17 15:59

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the …

arXiv cs.CL TIER_1 English(EN) · Xiaojie Wang · 2026-06-17 12:28

Enhancing Multilingual Reasoning via Steerable Model Merging

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fa…

arXiv cs.CL TIER_1 English(EN) · Hao Sun · 2026-06-17 11:37

GraphPO: Graph-based Policy Optimization for Reasoning Models

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First,…

arXiv cs.AI TIER_1 English(EN) · Bihao Zhan, Zongsheng Cao, Jie Zhou, Bo Zhang, Liang He · 2026-06-17 04:00

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

arXiv:2606.17856v1 Announce Type: new Abstract: Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagat…

arXiv cs.AI TIER_1 English(EN) · Sajad Movahedi, Vera Milovanovi\'c, Shlomo Libo Feigin, Alexander Theus, Thomas Hofmann, Valentina Boeva, T. Konstantin Rusch, Antonio Orvieto · 2026-06-17 04:00

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

arXiv:2606.18206v1 Announce Type: new Abstract: Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The number of effective layers reached by looping determines the quality of the solution these mo…

arXiv cs.AI TIER_1 English(EN) · Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo · 2026-06-17 04:00

Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty

arXiv:2606.17312v1 Announce Type: new Abstract: Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing metho…

arXiv cs.LG TIER_1 English(EN) · Chia-Hsuan Hsu, Jui-Ming Yao · 2026-06-17 04:00

Learning to Refine Hidden States for Reliable LLM Reasoning

arXiv:2606.17524v1 Announce Type: new Abstract: Large language models show strong reasoning ability, but their internal reasoning process can remain unstable in complex multi-step settings, where early hidden-state errors may propagate to incorrect predictions. We propose ReLAR, …

arXiv cs.CL TIER_1 English(EN) · Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao · 2026-06-17 04:00

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

arXiv:2601.03872v2 Announce Type: replace Abstract: The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combinati…

arXiv cs.CL TIER_1 English(EN) · Aryasomayajula Ram Bharadwaj · 2026-06-17 04:00

Adaptive Activation Steering for Efficient LLM Reasoning via Closed-Loop PID Control

arXiv:2506.18831v3 Announce Type: replace Abstract: Reasoning LLMs trained with long chain-of-thought often overthink: they spend tokens on redundant reflection and transitions that inflate cost without improving accuracy. Static activation steering (e.g.\ SEAL) suppresses such c…

arXiv cs.CL TIER_1 English(EN) · Peixian Zhou, Yuxu Chen, Chaorui Zhang, Wei Han, Bo Bai, Xueyan Niu · 2026-06-17 04:00

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

arXiv:2606.17905v1 Announce Type: new Abstract: Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests …

arXiv cs.CL TIER_1 English(EN) · Zihao Wei, Wenjie Shi, Liang Pang, Jingcheng Deng, Shicheng Xu, Shasha Guo, Zenghao Duan, Jiahao Liu, Jingang Wang, Huawei Shen, Xueqi Cheng · 2026-06-17 04:00

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

arXiv:2606.17890v1 Announce Type: new Abstract: Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study th…

arXiv cs.AI TIER_1 English(EN) · Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan · 2026-06-17 04:00

Learning-Infused Formal Reasoning: From Contract Synthesis to Artifact Reuse and Formal Semantics

arXiv:2602.02881v2 Announce Type: replace-cross Abstract: This paper articulates a long-term research vision for formal methods at the intersection with artificial intelligence, outlining multiple conceptual and technical dimensions and reporting on our ongoing work toward realis…

arXiv cs.AI TIER_1 English(EN) · Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min zhang, Jing Li, Xuelong Li · 2026-06-17 04:00

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

arXiv:2606.17687v1 Announce Type: cross Abstract: Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this ineff…

arXiv cs.CL TIER_1 English(EN) · Elias Stengel-Eskin · 2026-06-17 02:41

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

Natural language understanding often depends on meanings that are implied rather than explicitly stated, requiring pragmatic reasoning. Despite strong performance on math and logical reasoning, large language models (LLMs) still struggle with making pragmatic inferences, often ch…

arXiv cs.AI TIER_1 English(EN) · Antonio Orvieto · 2026-06-16 17:36

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The number of effective layers reached by looping determines the quality of the solution these models find. Like deep architectures, looped archi…

arXiv cs.CL TIER_1 English(EN) · Xueyan Niu · 2026-06-16 13:28

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning perfor…

arXiv cs.CL TIER_1 English(EN) · Xueqi Cheng · 2026-06-16 13:10

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

Long-form chain-of-thought reasoning can improve LLM performance on complex tasks, but models often continue generating unnecessary reasoning after a correct answer has emerged. We refer to this behavior as overthinking. We study this phenomenon from the perspective of GRPO-style…

arXiv cs.AI TIER_1 English(EN) · Liang He · 2026-06-16 12:28

FlowRAG: Synergizing Explicit Reasoning via Frequency-Aware Multi-Granularity Graph Flow

Graph-based retrieval-augmented generation (GraphRAG) is effective for knowledge-intensive and multi-hop query tasks; however, many existing methods primarily seed entity-based graphs and rely on implicit semantic relevance propagation. This often (i) under-retrieves when user qu…

arXiv cs.CL TIER_1 English(EN) · Xuelong Li · 2026-06-16 08:52

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes…

arXiv cs.LG TIER_1 English(EN) · Chuxue Cao, Jinluan Yang, Haoran Li, Kunhao Pan, Zijian Zhao, Zhengyu Chen, Yuchen Tian, Lijun Wu, Conghui He, Sirui Han, Yike Guo · 2026-06-16 04:00

Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification

arXiv:2601.22642v2 Announce Type: replace Abstract: Large Language Models (LLMs) show remarkable capabilities, yet their stochastic next-token prediction creates logical inconsistencies and reward hacking that formal symbolic systems avoid. To bridge this gap, we introduce a form…

arXiv cs.LG TIER_1 English(EN) · David Huang, Lianlei Shan · 2026-06-16 04:00

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

arXiv:2606.15160v1 Announce Type: cross Abstract: Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance mult…

arXiv cs.LG TIER_1 English(EN) · Lukas Fesser, Hanlin Zhang, Michelle M. Li, Eric Wang, Bryan Perozzi, Shekoofeh Azizi, Sham M. Kakade, Marinka Zitnik · 2026-06-16 04:00

How Post-Training Shapes Biological Reasoning Models

arXiv:2606.16517v1 Announce Type: new Abstract: Scientific reasoning models for biology combine language models with foundation models trained on multimodal biological data, including DNA, RNA, and proteins. These models are built through post-training, yet how each stage shapes …

arXiv cs.LG TIER_1 English(EN) · Xian Sun, Wei Gao, Yingshuo Wang, Lingdong Kong, Yanhang Li, Zhichao Fan, Zexin Zhuang, Wenlong Dong, Zhiyuan Zheng, Hrishikesh Paranjape, Abhishek Mandal, Johnny R. Zhang · 2026-06-16 04:00

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

arXiv:2606.15127v1 Announce Type: new Abstract: Reasoning models are increasingly used in settings where the final answer is not the only object of review: educational tools may show students intermediate steps, decision-support systems may require human oversight, and audit work…

arXiv cs.CL TIER_1 English(EN) · Juming Xiong, Kevin Guo, Congning Ni, Wexin Liu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin · 2026-06-16 04:00

Learning When to Sample: Confidence-Aware Selective Sampling for Efficient Chain-of-Thought Reasoning

arXiv:2603.08999v3 Announce Type: replace Abstract: Large language models (LLMs) can achieve strong reasoning performance through chain-of-thought (CoT) reasoning, yet they often generate unnecessarily long reasoning paths that incur high inference cost. Self-consistency-based ap…

arXiv cs.CL TIER_1 English(EN) · Jaehui Hwang, Byeongho Heo, Sangdoo Yun, Dongyoon Han · 2026-06-16 04:00

Oops, Wait: Discourse Tokens Matter in Reasoning Model

arXiv:2601.17421v2 Announce Type: replace Abstract: Recent studies suggest that even data-efficient training with ($\simeq$1K) reasoning trajectories can induce non-trivial reasoning capabilities in large language models through post-training. Such training corpora often contain …

arXiv cs.CL TIER_1 English(EN) · Hoang Pham, Dong Le, Anh Tuan Luu · 2026-06-16 04:00

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

arXiv:2606.16151v1 Announce Type: new Abstract: Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can …

arXiv cs.CL TIER_1 English(EN) · Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin · 2026-06-16 04:00

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

arXiv:2606.15872v1 Announce Type: new Abstract: Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial comple…

arXiv cs.CL TIER_1 English(EN) · Jiakai Li, Ke Qin, Rongzheng Wang, Yizhuo Ma, Qizhi Chen, Muquan Li, Shuang Liang · 2026-06-16 04:00

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

arXiv:2606.15070v1 Announce Type: new Abstract: By incorporating test-time compute scaling, large reasoning models (LRMs) can solve complex problems through explicit chain-of-thought (CoT) reasoning processes. However, they often suffer from overthinking, resulting in redundant t…

arXiv cs.CL TIER_1 English(EN) · Juming Xiong, Weixin Liu, Kevin Guo, Congning Ni, Junchao Zhu, Chongyu Qu, Chao Yan, Katherine Brown, Avinash Baidya, Xiang Gao, Bradley Malin, Zhijun Yin · 2026-06-16 04:00

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

arXiv:2606.14961v1 Announce Type: new Abstract: Chain-of-thought (CoT) reasoning can improve LLM performance, but high answer confidence may be misleading when the accompanying CoT rationale is plausible yet incomplete or poorly supported. We study confidence--rationale alignment…

arXiv cs.AI TIER_1 English(EN) · Alex Bogdan · 2026-06-16 04:00

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

arXiv:2606.15877v1 Announce Type: cross Abstract: Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects a…

arXiv cs.AI TIER_1 English(EN) · Zhenyu Yu · 2026-06-16 04:00

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

arXiv:2606.15733v1 Announce Type: cross Abstract: Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer ar…

arXiv cs.AI TIER_1 English(EN) · Yu Li, Shu Hong, Tian Lan · 2026-06-16 04:00

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

arXiv:2606.15576v1 Announce Type: cross Abstract: Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same …

arXiv cs.AI TIER_1 English(EN) · Dayeon Ki, Kevin Duh, Marine Carpuat · 2026-06-16 04:00

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

arXiv:2606.15080v1 Announce Type: cross Abstract: While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidel…

arXiv cs.AI TIER_1 English(EN) · Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi · 2026-06-16 04:00

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

arXiv:2606.16811v1 Announce Type: new Abstract: For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess rea…

arXiv cs.AI TIER_1 English(EN) · Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin · 2026-06-16 04:00

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

arXiv:2606.16808v1 Announce Type: new Abstract: While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data anno…

arXiv cs.AI TIER_1 English(EN) · Yaoting Huang, Yifu Yuan, Linqi Han, Chengwen Li, Shuoheng Zhang, Xianze Yao, Hongyao Tang, Yan Zheng, Jianye Hao · 2026-06-16 04:00

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

arXiv:2606.15753v1 Announce Type: new Abstract: Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-…

arXiv cs.AI TIER_1 English(EN) · Gowrav Mannem, Chowdhury Marzia Mahjabin, Jason Chen, Shivank Garg, Kevin Zhu · 2026-06-16 04:00

Recurrent Reasoning on Symbolic Puzzles with Sequence Models

arXiv:2606.15686v1 Announce Type: new Abstract: Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current r…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-16 00:00

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

ChLogic benchmark reveals persistent performance gaps between English and Chinese logical reasoning in large language models, influenced by surface realization differences and translation artifacts.

arXiv cs.AI TIER_1 English(EN) · Sado Kurohashi · 2026-06-15 14:55

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-super…

arXiv cs.AI TIER_1 English(EN) · Zhan Qin · 2026-06-15 14:51

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe…

arXiv cs.AI TIER_1 English(EN) · Alex Schutz, Victor-Alexandru Darvariu, Efimia Panagiotaki, Bruno Lacerda, Nick Hawes · 2026-06-15 04:00

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

arXiv:2509.18930v3 Announce Type: replace-cross Abstract: Neural algorithmic reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions…

arXiv cs.AI TIER_1 English(EN) · Pratham Singla, Shivank Garg, Vihan Singh · 2026-06-15 04:00

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

arXiv:2606.13815v1 Announce Type: new Abstract: Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capabilit…

arXiv cs.AI TIER_1 English(EN) · Zheyang Xiong, Shivam Garg, Max Yu, Vaishnavi Shrivastava, Haoyu Zhao, Anastasios Kyrillidis, Dimitris Papailiopoulos · 2026-06-15 04:00

SuperThoughts: Reasoning Tokens in Superposition

arXiv:2606.13862v1 Announce Type: cross Abstract: Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token gene…

arXiv cs.AI TIER_1 English(EN) · Avni Mittal, Rauno Arike · 2026-06-15 04:00

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

arXiv:2603.05167v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We intr…

arXiv cs.AI TIER_1 English(EN) · Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki · 2026-06-15 04:00

Distributional Biases in Post-Training: A Markovian Analysis of Reasoning Trajectories

arXiv:2511.07368v3 Announce Type: replace-cross Abstract: Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RL with verifiable rewards (RLVR) and test-time scaling (TTS). While recent work highlights the rol…

arXiv cs.CL TIER_1 English(EN) · Anh Tuan Luu · 2026-06-15 03:11

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-14 15:45

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

SciOrch is a framework that uses a lightweight orchestrator model to coordinate multiple frontier LLMs for scientific reasoning, achieving superior performance through MCTS-based training and GRPO-style optimization while reducing API costs.

arXiv cs.CL TIER_1 English(EN) · Darpan Aswal, Thomas Palmeira Ferraz, Yongxin Zhou, Maxime Peyrard · 2026-06-12 04:00

Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models

arXiv:2606.12689v1 Announce Type: new Abstract: Latent reasoning models (LRMs) replace explicit chain-of-thought with continuous thoughts. Recent work treats observable latent-state patterns, such as BFS-like frontiers and decodable arithmetic computation, as evidence for interna…

arXiv cs.AI TIER_1 English(EN) · Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Rapha\"el Milli\`ere, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Do… · 2026-06-12 04:00

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

arXiv:2510.16380v2 Announce Type: replace-cross Abstract: As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but al…

arXiv cs.AI TIER_1 English(EN) · Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez · 2026-06-12 04:00

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

arXiv:2606.13680v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning ta…

arXiv cs.AI TIER_1 English(EN) · Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman · 2026-06-12 04:00

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

arXiv:2606.13125v1 Announce Type: cross Abstract: Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capa…

arXiv cs.AI TIER_1 English(EN) · Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti · 2026-06-12 04:00

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

arXiv:2606.13603v1 Announce Type: cross Abstract: Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance…

arXiv cs.AI TIER_1 English(EN) · Sarah Elshabrawy, Rahul K. Dass, Ashok K. Goel · 2026-06-12 04:00

Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage

arXiv:2606.12767v1 Announce Type: new Abstract: Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded in the instructional knowledge the system is expected to use. We study how TMK-based question…

arXiv cs.AI TIER_1 English(EN) · Pierre Beckmann, Marco Valentino, Andre Freitas · 2026-06-12 04:00

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

arXiv:2606.13020v1 Announce Type: new Abstract: Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on …

arXiv cs.AI TIER_1 English(EN) · Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton · 2026-06-12 04:00

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

arXiv:2606.13176v1 Announce Type: new Abstract: Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for men…

arXiv cs.AI TIER_1 English(EN) · Fabrizio Marozzo, Pietro Li\`o · 2026-06-12 04:00

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

arXiv:2606.13220v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align wit…

arXiv cs.AI TIER_1 English(EN) · Zach Studdiford, Gary Lupyan · 2026-06-12 04:00

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

arXiv:2606.13607v1 Announce Type: new Abstract: When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that p…

arXiv cs.CL TIER_1 English(EN) · Yaniv Nikankin, Martin Tutek, Tomer Ashuach, Jonathan Rosenfeld, Yonatan Belinkov · 2026-06-12 04:00

Reasoning Models Know What's Important, and Encode It in Their Activations

arXiv:2604.18307v2 Announce Type: replace Abstract: Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining whi…

arXiv cs.CL TIER_1 English(EN) · Nathaniel Bottman, Yinhong Liu, Kyle Richardson · 2026-06-12 04:00

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

arXiv:2606.13649v1 Announce Type: new Abstract: Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self…

arXiv cs.CL TIER_1 English(EN) · Nathaniel Bottman, Kyle Richardson · 2026-06-12 04:00

Operads for compositional reasoning in LLMs

arXiv:2606.13634v1 Announce Type: new Abstract: Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematic…

arXiv cs.CL TIER_1 English(EN) · Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo · 2026-06-12 04:00

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

arXiv:2606.12941v1 Announce Type: new Abstract: When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by…

arXiv cs.CL TIER_1 English(EN) · Dimitris Papailiopoulos · 2026-06-11 19:42

SuperThoughts: Reasoning Tokens in Superposition

Long Chain-of-Thought (CoT) reasoning improves LLM problem-solving but is computationally expensive due to sequential token generation. While recent works explore reasoning in continuous latent spaces to bypass discrete token generation, they often struggle with training stabilit…

arXiv cs.CL TIER_1 English(EN) · Vihan Singh · 2026-06-11 18:39

Poker Arena: Multi-Axis Profiling of Strategic Reasoning and Memory in LLMs

Strategic reasoning under uncertainty underpins consequential decisions in negotiation, finance, and policy, but prevailing game-play benchmarks collapse heterogeneous reasoning dimensions into a single scalar, leaving the capability structure of frontier LLMs unexamined. We intr…

arXiv cs.AI TIER_1 English(EN) · Vicente Ordonez · 2026-06-11 17:59

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an …

arXiv cs.CL TIER_1 English(EN) · Kyle Richardson · 2026-06-11 17:50

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for sy…

arXiv cs.CL TIER_1 English(EN) · Kyle Richardson · 2026-06-11 17:44

Operads for compositional reasoning in LLMs

Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads…

arXiv cs.AI TIER_1 English(EN) · Gary Lupyan · 2026-06-11 17:23

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 17:23

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types…

arXiv cs.AI TIER_1 English(EN) · Gabriele Sarti · 2026-06-11 17:21

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how …

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Pietro Liò · 2026-06-11 11:37

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-11 11:37

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before…

arXiv cs.LG TIER_1 English(EN) · Nived Rajaraman · 2026-06-11 09:51

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcemen…

arXiv cs.CL TIER_1 English(EN) · Jiaxian Guo · 2026-06-11 06:07

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling m…

arXiv cs.AI TIER_1 English(EN) · Valentin No\"el · 2026-06-11 04:00

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

arXiv:2601.00791v2 Announce Type: replace-cross Abstract: Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning ind…

arXiv cs.AI TIER_1 English(EN) · Jana Zeller, Thadd\"aus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel · 2026-06-11 04:00

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

arXiv:2602.02465v2 Announce Type: replace Abstract: Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest …

arXiv cs.AI TIER_1 English(EN) · Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky · 2026-06-11 04:00

Mind the Perspective: Let's Reason Recursively for Theory of Mind

arXiv:2606.11724v1 Announce Type: new Abstract: Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event…

arXiv cs.AI TIER_1 English(EN) · Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee · 2026-06-11 04:00

From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning

arXiv:2606.11195v1 Announce Type: cross Abstract: Large language models (LLMs) have transformed how humans access information, but not how we reason with it. Their fluency accelerates consumption while bypassing the slow, reflective processes that underpin sound judgment. This pa…

arXiv cs.AI TIER_1 English(EN) · Prakul Sunil Hiremath, Harshit R. Hiremath · 2026-06-11 04:00

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

arXiv:2606.11211v1 Announce Type: cross Abstract: The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is …

arXiv cs.AI TIER_1 English(EN) · Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas · 2026-06-11 04:00

Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

arXiv:2504.09762v4 Announce Type: replace Abstract: Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \s…

arXiv cs.AI TIER_1 English(EN) · Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing · 2026-06-11 04:00

GPO: Learning from Critical Steps to Improve LLM Reasoning

arXiv:2509.16456v3 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilit…

arXiv cs.CL TIER_1 English(EN) · Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu · 2026-06-11 04:00

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

arXiv:2606.12373v1 Announce Type: new Abstract: Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantit…

arXiv cs.CL TIER_1 English(EN) · Yijie Deng, He Zhu, Wen Wang, Junyou Su, Minxin Chen, Wenjia Zhang · 2026-06-11 04:00

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

arXiv:2606.11678v1 Announce Type: new Abstract: Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Al…

arXiv cs.CL TIER_1 English(EN) · Avinash Anand, Mahisha Ramesh, Avni Mittal, Ashutosh Kumar, Erik Cambria, Zhengkui Wang, Timothy Liu, Aik Beng Ng, Simon See, Rajiv Ratn Shah · 2026-06-11 04:00

The Periodic Table of LLM Reasoning: A Structured Survey of Reasoning Paradigms, Methods, and Failure Modes

arXiv:2606.11470v1 Announce Type: new Abstract: Large Language Models (LLMs) have achieved strong performance across natural language processing tasks, yet reliable reasoning remains an open challenge. Although modern LLMs show progress in structured inference, multi-step problem…

arXiv cs.LG TIER_1 English(EN) · Hongyi Liu, Frederic Sala, Thomas Reps, Adithya Murali · 2026-06-11 04:00

Counterexample Guided Learning in the Large using Reasoning Agents

arXiv:2606.11521v1 Announce Type: new Abstract: LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to per…

arXiv cs.CL TIER_1 English(EN) · Dayiheng Liu · 2026-06-10 17:39

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or in…

arXiv cs.CL TIER_1 English(EN) · Wenjia Zhang · 2026-06-10 05:42

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in plannin…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 05:42

Can AI Reason Like an Urban Planner? Benchmarking Large Language Models Against Professional Judgment

Problem, Research Strategy, and Findings: The rise of large language models (LLMs) raises a key question for urban planning: which forms of professional planning knowledge can AI replicate, and which still require human judgment? Although AI tools are increasingly used in plannin…

arXiv cs.LG TIER_1 English(EN) · Evgenii Kortukov, Piotr Komorowski, Florian Klein, Paula Engl, Gabriele Sarti, Seong Joon Oh, Sebastian Lapuschkin, Wojciech Samek · 2026-06-10 04:00

Predicting Future Behaviors in Reasoning Models Enables Better Steering

arXiv:2606.11172v1 Announce Type: new Abstract: Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitl…

arXiv cs.CL TIER_1 English(EN) · Adi Gabay, Gabriel Stanovsky, Liat Peterfreund · 2026-06-10 04:00

Beyond Memorization: Distinguishing Between Pattern-Based and Epistemic Reasoning in LLMs Using Epistemic Puzzles

arXiv:2603.21350v2 Announce Type: replace Abstract: Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on epistemic puzzles often frames failures as memorization r…

arXiv cs.CL TIER_1 English(EN) · Alexander Gurung, Esmeralda S. Whitammer, Mirella Lapata · 2026-06-10 04:00

Lightweight Latent Reasoning for Narrative Tasks

arXiv:2512.02240v2 Announce Type: replace Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces ca…

arXiv cs.CL TIER_1 English(EN) · Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan · 2026-06-10 04:00

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

arXiv:2606.10646v1 Announce Type: cross Abstract: Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routin…

arXiv cs.CL TIER_1 English(EN) · Prajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV, Furong Huang, Amrit Singh Bedi, Alvaro Velasquez · 2026-06-10 04:00

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

arXiv:2606.11046v1 Announce Type: new Abstract: Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the ali…

arXiv cs.CL TIER_1 English(EN) · Sanghee Park, Geewook Kim, Kee-Eung Kim · 2026-06-10 04:00

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

arXiv:2606.10403v1 Announce Type: new Abstract: Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mat…

arXiv cs.AI TIER_1 English(EN) · Daeyong Kwon, Soyoung Yoon, Seung-won Hwang · 2026-06-10 04:00

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

arXiv:2604.01993v2 Announce Type: replace-cross Abstract: Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through invalid intermediate reasoning. We propose SAFE, an LLM-as-verifier framework for evide…

arXiv cs.AI TIER_1 English(EN) · Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman · 2026-06-10 04:00

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

arXiv:2603.29025v3 Announce Type: replace-cross Abstract: Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint fam…

arXiv cs.AI TIER_1 English(EN) · Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca · 2026-06-10 04:00

Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

arXiv:2511.10234v3 Announce Type: replace-cross Abstract: While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under…

arXiv cs.AI TIER_1 English(EN) · Wooil Jung · 2026-06-10 04:00

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

arXiv:2606.10184v1 Announce Type: cross Abstract: Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - \mu_r$ collapses to zero. This presents a structural challenge for laten…

arXiv cs.AI TIER_1 English(EN) · Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi · 2026-06-10 04:00

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

arXiv:2606.10740v1 Announce Type: new Abstract: Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustl…

arXiv cs.AI TIER_1 English(EN) · Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang · 2026-06-10 04:00

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

arXiv:2606.10254v1 Announce Type: new Abstract: While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-10 00:00

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

Recursive automated composition framework enables scalable reinforcement learning for language models by automatically combining verifiable environments through compositional operators.

arXiv cs.LG TIER_1 English(EN) · Wojciech Samek · 2026-06-09 17:49

Predicting Future Behaviors in Reasoning Models Enables Better Steering

Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavi…

arXiv cs.CL TIER_1 English(EN) · Alvaro Velasquez · 2026-06-09 16:14

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 11:50

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Multi-turn reasoning models exhibit hidden alignment failures that are masked by traditional evaluation methods, revealing vulnerabilities through a trace-level diagnostic framework that identifies distinct failure modes including context-injection failures.

arXiv cs.AI TIER_1 English(EN) · Samuele Poppi · 2026-06-09 11:50

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden tempo…

arXiv cs.CL TIER_1 English(EN) · Junchi Yan · 2026-06-09 09:56

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts lev…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-09 09:56

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

FlowTracer is an RL framework that uses attention-induced graphs to trace reasoning flows and assign token-level credit based on global information propagation structures.

arXiv cs.CL TIER_1 English(EN) · Kee-Eung Kim · 2026-06-09 04:25

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set …

arXiv cs.AI TIER_1 English(EN) · Sanjay Kariyappa, G. Edward Suh · 2026-06-09 04:00

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

arXiv:2606.07808v1 Announce Type: new Abstract: Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks…

arXiv cs.AI TIER_1 English(EN) · Mujtaba Farhan, Maheep Chaudhary · 2026-06-09 04:00

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

arXiv:2606.07720v1 Announce Type: new Abstract: Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to …

arXiv cs.LG TIER_1 English(EN) · Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan · 2026-06-09 04:00

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

arXiv:2510.13554v2 Announce Type: replace-cross Abstract: The reasoning pattern of Large language models (LLMs) remains opaque, and reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps…

arXiv cs.LG TIER_1 English(EN) · Zhanke Zhou, Xiangyu Lu, Chentao Cao, Brando Miranda, Tongliang Liu, Bo Han, Sanmi Koyejo · 2026-06-09 04:00

The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

arXiv:2606.07950v1 Announce Type: new Abstract: RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often treats easy, hard, and learnable questions alike through uniform sampling and weighting, leading to inefficient compute alloc…

arXiv cs.AI TIER_1 English(EN) · Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei · 2026-06-09 04:00

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

arXiv:2605.19228v2 Announce Type: replace-cross Abstract: Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confide…

arXiv cs.AI TIER_1 English(EN) · Javier Mar\'in · 2026-06-09 04:00

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

arXiv:2603.13259v2 Announce Type: replace-cross Abstract: When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two pathways through hidden-state space diverge in a specific way: displacement vectors …

arXiv cs.AI TIER_1 English(EN) · Shivam Adarsh, Maria Maistro, Christina Lioma · 2026-06-09 04:00

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

arXiv:2601.06599v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change w…

arXiv cs.AI TIER_1 English(EN) · Onat Ozer, Yuchen Wang, Grace Wu, Daniel Dosti, Honghao Zhang, Vivi De La Rue · 2026-06-09 04:00

MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

arXiv:2512.20845v2 Announce Type: replace Abstract: LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit deg…

arXiv cs.AI TIER_1 English(EN) · Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, Wei Wang · 2026-06-09 04:00

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

arXiv:2510.12171v2 Announce Type: replace Abstract: Large Language Models have shown strong scientific reasoning ability, but their performance on materials science problems remains less studied. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark …

arXiv cs.AI TIER_1 English(EN) · Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth · 2026-06-09 04:00

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

arXiv:2507.09751v3 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage para…

arXiv cs.AI TIER_1 English(EN) · Subramanyam Sahoo · 2026-06-09 04:00

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

arXiv:2606.08571v1 Announce Type: cross Abstract: Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured…

arXiv cs.AI TIER_1 English(EN) · Hengxin Fan · 2026-06-09 04:00

Capacity, Not Format: Rethinking Structured Reasoning Failures

arXiv:2606.09410v1 Announce Type: new Abstract: Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity…

arXiv cs.AI TIER_1 English(EN) · Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao · 2026-06-09 04:00

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

arXiv:2606.08974v1 Announce Type: new Abstract: Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of …

arXiv cs.AI TIER_1 English(EN) · Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan · 2026-06-09 04:00

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

arXiv:2606.08728v1 Announce Type: new Abstract: Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified acc…

arXiv cs.AI TIER_1 English(EN) · Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu · 2026-06-09 04:00

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

arXiv:2606.08596v1 Announce Type: new Abstract: Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to lear…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Foutse Khomh · 2026-06-09 02:18

Game-Theoretic Multi-Agent Control for Robust Contextual Reasoning in LLMs

Large Language Models (LLMs) in multi-turn interactions maintain evolving context rather than generating isolated responses, making them vulnerable to prompt-injection and context-poisoning attacks in which locally plausible adversarial fragments gradually distort reasoning traje…

arXiv cs.CL TIER_1 English(EN) · André Freitas · 2026-06-08 12:57

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a…

arXiv cs.AI TIER_1 English(EN) · Hengxin Fan · 2026-06-08 12:26

Capacity, Not Format: Rethinking Structured Reasoning Failures

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects f…

arXiv cs.CL TIER_1 English(EN) · Christina Niklaus · 2026-06-08 09:23

TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning

We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge i…

arXiv cs.CL TIER_1 English(EN) · Huajun Chen · 2026-06-08 08:30

Symbolic and Abstractive Reasoning with Complex Visual Queries

Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which …

arXiv cs.CL TIER_1 English(EN) · Liang Wang · 2026-06-08 05:01

CRANE: Knowledge Editing for Reasoning MLLMs

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing …

arXiv cs.AI TIER_1 English(EN) · Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer · 2026-06-08 04:00

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

arXiv:2606.06635v1 Announce Type: cross Abstract: Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two…

arXiv cs.AI TIER_1 English(EN) · Vladislav Smirnov (MBZUAI), Chieu Nguyen (MBZUAI), Sergey Senichev (Independent Researcher), Minh Ngoc Ta (MBZUAI), Ekaterina Fadeeva (ETH Z\"urich), Artem Vazhentsev (MBZUAI), Daria Galimzianova (MBZUAI), Nikolai Rozanov (MBZUAI, Imperial College London… · 2026-06-08 04:00

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

arXiv:2606.06915v1 Announce Type: cross Abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based rerankin…

arXiv cs.AI TIER_1 English(EN) · Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam · 2026-06-08 04:00

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

arXiv:2606.06840v1 Announce Type: cross Abstract: Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investi…

arXiv cs.AI TIER_1 English(EN) · Tengyao Tu, Yulin Li, Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li, Min Zhang · 2026-06-08 04:00

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

arXiv:2606.07108v1 Announce Type: new Abstract: Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as …

arXiv cs.CL TIER_1 English(EN) · Donald Ye, Max Loffgren, Om Kotadia, Linus Wong, Jonas Rohweder · 2026-06-08 04:00

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

arXiv:2602.11201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or me…

arXiv cs.CL TIER_1 English(EN) · Xinze Li, Yuqing Lan, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun · 2026-06-08 04:00

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

arXiv:2601.09402v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge into the generation process. Benefiting from the reasoning capabilities of LLMs, existing methods have leveraged such…

arXiv cs.CL TIER_1 English(EN) · Yongliang Miao, Fengyuan Liu, Wei Shi, Yanguang Liu, Fei Sun, Na Zou, Mengnan Du · 2026-06-08 04:00

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

arXiv:2606.07006v1 Announce Type: cross Abstract: Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reaso…

arXiv cs.CL TIER_1 English(EN) · Zhixuan He, Yue Feng · 2026-06-08 04:00

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

arXiv:2606.06745v1 Announce Type: new Abstract: Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for…

arXiv cs.AI TIER_1 English(EN) · Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza · 2026-06-08 04:00

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

arXiv:2604.17433v2 Announce Type: replace-cross Abstract: Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We in…

arXiv cs.AI TIER_1 English(EN) · Yuxiang Chen, Jun Wang · 2026-06-08 04:00

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

arXiv:2606.07410v1 Announce Type: cross Abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive em…

arXiv cs.AI TIER_1 English(EN) · Rahul Nair, Chun Tao · 2026-06-08 04:00

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv:2606.06920v1 Announce Type: cross Abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B)…

arXiv cs.AI TIER_1 English(EN) · Hejun Wu · 2026-06-07 12:20

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretabil…

arXiv cs.AI TIER_1 English(EN) · Subramanyam Sahoo · 2026-06-07 11:01

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted …

arXiv cs.CL TIER_1 English(EN) · Xueru Zhang · 2026-06-06 18:31

TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation

Tabular data is a primary medium for storing real-world information, driving many industrial applications of machine learning. Traditional predictors achieve strong predictive performance but do not provide readable, case-specific explanations essential for decision-making. Large…

arXiv cs.AI TIER_1 English(EN) · Xiaopeng Yuan, Haibo Jin, Ye Yu, Peng Kuang, Lijun Yu, Yushun Dong, Haohan Wang · 2026-06-06 04:00

Closing the Loop on Latent Reasoning via Test-Time Reconstruction

arXiv:2606.06252v1 Announce Type: new Abstract: Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantag…

arXiv cs.AI TIER_1 English(EN) · Jiate Liu, Zebin Chen, Shaobo Qiao, Mingchen Ju, Danting Zhang, Bocheng Han, Shuyue Yu, Xin Shu, Jinglin Wu, Dong Wen, Xin Cao, Guanfeng Liu, Zhengyi Yang · 2026-06-06 04:00

A2RAG: Adaptive Agentic Graph Retrieval for Cost-Aware and Reliable Reasoning

arXiv:2601.21162v2 Announce Type: replace-cross Abstract: Graph Retrieval-Augmented Generation (Graph-RAG) enhances multihop question answering by organizing corpora into knowledge graphs and routing evidence through relational structure. However, practical deployments face two p…

arXiv cs.AI TIER_1 English(EN) · Hamed Nejat, Alexander Maier, Jesse Spencer-Smith, Andr\'e M. Bastos · 2026-06-06 04:00

Ontology-constrained multi-LLM scoring of hypothesis support in the predictive processing literature

arXiv:2606.05206v1 Announce Type: cross Abstract: Fragmentation is common in interdisciplinary fields with diverse methods and theoretical commitments. Predictive coding neuroscience is a clear example: its literature spans computational theory, electrophysiology, imaging, behavi…

arXiv cs.LG TIER_1 English(EN) · Jun Wang · 2026-06-05 15:57

A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoni…

arXiv cs.AI TIER_1 English(EN) · Min Zhang · 2026-06-05 10:02

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate thi…

arXiv cs.CL TIER_1 English(EN) · Mengnan Du · 2026-06-05 07:52

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly followi…

arXiv cs.AI TIER_1 English(EN) · Chun Tao · 2026-06-05 05:34

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1B models (135M-1B) on mathematical reasoning tasks and uncover a cri…

arXiv cs.CL TIER_1 English(EN) · Artem Shelmanov · 2026-06-05 05:28

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning s…

arXiv cs.LG TIER_1 English(EN) · Nirit Nussbaum-Hoffer, Nitay Calderon, Liat Ein-Dor, Roi Reichart · 2026-06-05 04:00

LLM Explainability with Counterfactual Chains and Causal Graphs

arXiv:2606.05972v1 Announce Type: new Abstract: Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model…

arXiv cs.LG TIER_1 English(EN) · Locke Cai, Max Ryabinin, Ivan Provilkov · 2026-06-05 04:00

Escaping the Verifier: Learning to Reason via Demonstrations

arXiv:2511.21667v4 Announce Type: replace Abstract: Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demons…

arXiv cs.LG TIER_1 English(EN) · Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter · 2026-06-05 04:00

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

arXiv:2606.06475v1 Announce Type: new Abstract: Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the mo…

arXiv cs.LG TIER_1 English(EN) · Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu · 2026-06-05 04:00

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

arXiv:2606.05533v1 Announce Type: new Abstract: Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning r…

arXiv cs.CL TIER_1 English(EN) · Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel · 2026-06-05 04:00

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

arXiv:2604.08477v2 Announce Type: replace-cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved reasoning in formal domains such as mathematics and code, but extending these gains beyond STEM remains challenging. Extending RLVR beyond ST…

arXiv cs.CL TIER_1 English(EN) · Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen · 2026-06-05 04:00

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

arXiv:2603.17310v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address th…

arXiv cs.CL TIER_1 English(EN) · Zhenyuan Guo, Tong Chen, Wenlong Meng, Chen Gong, Xin Yu, Chengkun Wei, Wenzhi Chen · 2026-06-05 04:00

Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

arXiv:2601.18383v2 Announce Type: replace-cross Abstract: Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and comput…

arXiv cs.CL TIER_1 English(EN) · Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang · 2026-06-05 04:00

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

arXiv:2504.10823v4 Announce Type: replace Abstract: Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character pe…

arXiv cs.CL TIER_1 English(EN) · Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham · 2026-06-05 04:00

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

arXiv:2606.05988v1 Announce Type: cross Abstract: Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B an…

arXiv cs.CL TIER_1 English(EN) · Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang, Haoqiang Kang, Lianhui Qin, Yizhe Zhang, Jiatao Gu · 2026-06-05 04:00

Latent Reasoning with Normalizing Flows

arXiv:2606.06447v1 Announce Type: new Abstract: Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and comm…

arXiv cs.CL TIER_1 English(EN) · Jinyang Zhang, Hongxin Ding, Yue Fang, Weibin Liao, Muyang Ye, Junfeng Zhao, Yasha Wang · 2026-06-05 04:00

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

arXiv:2606.06188v1 Announce Type: new Abstract: Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that th…

arXiv cs.CL TIER_1 English(EN) · Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li · 2026-06-05 04:00

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

arXiv:2606.05859v1 Announce Type: new Abstract: Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently det…

arXiv cs.CL TIER_1 English(EN) · Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier · 2026-06-05 04:00

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

arXiv:2606.05402v1 Announce Type: new Abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a frame…

arXiv cs.CL TIER_1 English(EN) · Ryan Solgi, Jiayi Tian, Zheng Zhang · 2026-06-05 04:00

LoRi: Low-Rank Distillation for Implicit Reasoning

arXiv:2606.05315v1 Announce Type: new Abstract: Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. …

arXiv cs.CL TIER_1 English(EN) · Javed A. Aslam · 2026-06-05 02:32

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We cha…

arXiv cs.CL TIER_1 English(EN) · Yue Feng · 2026-06-04 21:57

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. I…

arXiv cs.CL TIER_1 English(EN) · Mykel J. Kochenderfer · 2026-06-04 18:36

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first …

arXiv cs.AI TIER_1 English(EN) · Sepp Hochreiter · 2026-06-04 17:56

RREDCoT: Segment-Level Reward Redistribution for Reasoning Models

Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. T…

arXiv cs.CL TIER_1 English(EN) · Jiatao Gu · 2026-06-04 17:44

Latent Reasoning with Normalizing Flows

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 17:44

Latent Reasoning with Normalizing Flows

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning …

arXiv cs.AI TIER_1 English(EN) · Haohan Wang · 2026-06-04 14:54

Closing the Loop on Latent Reasoning via Test-Time Reconstruction

Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are …

arXiv cs.CL TIER_1 English(EN) · Yasha Wang · 2026-06-04 13:59

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogeno…

arXiv cs.CL TIER_1 English(EN) · Tanishq Mathew Abraham · 2026-06-04 10:30

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces…

arXiv cs.CL TIER_1 English(EN) · Qicheng Li · 2026-06-04 08:30

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations …

arXiv cs.AI TIER_1 (AF) · Anshul Nayak, Shahil Shaik, Yue Wang · 2026-06-04 04:00

Belief-Aware VLM Model for Human-like Reasoning

arXiv:2604.09686v2 Announce Type: replace Abstract: Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks and dynamic environments. Recent advances in Vision Language Models (VLMs) and Vision Langu…

arXiv cs.AI TIER_1 English(EN) · Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han · 2026-06-04 04:00

Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time

arXiv:2504.12329v2 Announce Type: replace-cross Abstract: Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinkin…

arXiv cs.AI TIER_1 English(EN) · Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu · 2026-06-04 04:00

OckBench: Measuring the Efficiency of LLM Reasoning

arXiv:2511.05722v3 Announce Type: replace-cross Abstract: Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: ef…

arXiv cs.AI TIER_1 English(EN) · Wang Yang, Debargha Ganguly, Xinpeng Li, Chaoda Song, Shouren Wang, Vikash Singh, Vipin Chaudhary, Xiaotian Han · 2026-06-04 04:00

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

arXiv:2601.07036v2 Announce Type: replace-cross Abstract: Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger toke…

arXiv cs.AI TIER_1 English(EN) · Ethan Mendes, Jungsoo Park, Alan Ritter · 2026-06-04 04:00

Making Expert Reasoning Learnable with Self-Distillation

arXiv:2602.02405v2 Announce Type: replace-cross Abstract: Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct solution to be reinforced or the existence of a stronger model able to solve the probl…

arXiv cs.AI TIER_1 English(EN) · Jonas Petersen, Camilla Mazzoleni, Gian-Alessandro Lombardi, Federico Martelli, Riccardo Maggioni · 2026-06-04 04:00

What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA

arXiv:2602.02834v4 Announce Type: replace-cross Abstract: What structural inductive bias helps transformers reason over knowledge graphs? Through controlled ablations of a minimal transformer modification with four independently removable components (sparse adjacency masking, edg…

arXiv cs.CL TIER_1 English(EN) · Chongyang He, Rui Zhang, Zixuan Wang, Xin Li · 2026-06-04 04:00

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

arXiv:2606.04466v1 Announce Type: new Abstract: Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the…

arXiv cs.CL TIER_1 English(EN) · Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng · 2026-06-04 04:00

Characterizing, Evaluating, and Optimizing Complex Reasoning

arXiv:2602.08498v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how…

arXiv cs.CL TIER_1 English(EN) · Siqi Fan, Minghao Li, Xiaoqian Ma, Xiusheng Huang, Zhuo Chen, Bowen Qin, Liujie Zhang, Shuo Shang, Weihang Chen · 2026-06-04 04:00

Hint Tuning: Less Data Makes Better Reasoners

arXiv:2605.08665v2 Announce Type: replace Abstract: Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-e…

arXiv cs.CL TIER_1 English(EN) · Sanket Badhe, Deep Shah · 2026-06-04 04:00

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

arXiv:2602.21103v2 Announce Type: replace Abstract: Advanced reasoning typically requires Chain-of-Thought prompting, which is accurate but incurs prohibitive latency and substantial test-time inference costs. The standard alternative, fine-tuning smaller models, often sacrifices…

arXiv cs.CL TIER_1 English(EN) · Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul … · 2026-06-04 04:00

Efficient Reasoning on the Edge

arXiv:2603.16867v2 Announce Type: replace-cross Abstract: Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractica…

arXiv cs.LG TIER_1 English(EN) · Tiehua Mei, Minxuan Lv, Leiyu Pan, Zhenpeng Su, Hongru Hou, Hengrui Chen, Ao Xu, Deqing Yang · 2026-06-04 04:00

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

arXiv:2603.09803v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves reasoning in large language models but treats all correct solutions equally, potentially reinforcing flawed traces that arrive at correct answers by chance. We obser…

arXiv cs.LG TIER_1 English(EN) · Gleb Rodionov, Roman Garipov, George Yakushev · 2026-06-04 04:00

Reasoning Shift: How Context Silently Shortens LLM Reasoning

arXiv:2604.01161v2 Announce Type: replace Abstract: Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness…

arXiv cs.AI TIER_1 English(EN) · Jingbo Wen, Liang He, Ziqi He · 2026-06-04 04:00

Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

arXiv:2606.04402v1 Announce Type: new Abstract: Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty a…

arXiv cs.AI TIER_1 English(EN) · Yuhan Yang, Ruipu Li, Alexander Rodr\'iguez · 2026-06-04 04:00

Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making

arXiv:2606.04505v1 Announce Type: new Abstract: Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However, existing frameworks primarily use LLMs to generate, calibrate, or execute simulators, treati…

arXiv cs.AI TIER_1 English(EN) · Leonardo Bertolazzi, Katya Tentori, Raffaella Bernardi · 2026-06-04 04:00

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

arXiv:2606.04751v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open quest…

arXiv cs.AI TIER_1 English(EN) · Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme · 2026-06-04 04:00

DAR: Deontic Reasoning with Agentic Harnesses

arXiv:2606.05009v1 Announce Type: cross Abstract: Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key te…

arXiv cs.AI TIER_1 English(EN) · Zehua Cheng, Wei Dai, Jiahao Sun · 2026-06-04 04:00

Invariant Gradient Alignment for Robust Reasoning Distillation

arXiv:2606.05025v1 Announce Type: cross Abstract: Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermi…

arXiv cs.AI TIER_1 English(EN) · Wang Yang, Zirui Liu, Hongye Jin, Qingyu Yin, Vipin Chaudhary, Xiaotian Han · 2026-06-04 04:00

Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

arXiv:2505.17315v2 Announce Type: replace Abstract: Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

Latent Reasoning with Normalizing Flows

Latent reasoning framework using normalizing flows preserves autoregressive generation advantages while enabling efficient, probabilistic intermediate computation in large language models.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

LLM Explainability with Counterfactual Chains and Causal Graphs

Causal graphs are used to model large language model inference processes, enabling transparent visualization of how models perceive and organize high-level concepts for predictions through a four-phase method involving concept discovery, mapping, and MCMC-inspired counterfactual …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-04 00:00

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Post-hoc compression of reasoning traces reduces computational costs and inference lengths while maintaining high accuracy, offering an accuracy-efficiency trade-off in knowledge distillation.

arXiv cs.LG TIER_1 English(EN) · Jiahao Sun · 2026-06-03 15:48

Invariant Gradient Alignment for Robust Reasoning Distillation

Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer…

arXiv cs.CL TIER_1 English(EN) · Benjamin Van Durme · 2026-06-03 15:29

DAR: Deontic Reasoning with Agentic Harnesses

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning …

arXiv cs.AI TIER_1 English(EN) · Raffaella Bernardi · 2026-06-03 11:33

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 05:25

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better sui…

arXiv cs.CL TIER_1 English(EN) · Xin Li · 2026-06-03 05:25

Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better sui…

arXiv cs.AI TIER_1 English(EN) · Cl\'ement Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier · 2026-06-03 04:00

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

arXiv:2606.03719v1 Announce Type: new Abstract: The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventio…

arXiv cs.AI TIER_1 English(EN) · Zhengyi Zhao, Shubo Zhang, Huimin Wang, Zezhong Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Yulan He, Kam-Fai Wong, Xian Wu · 2026-06-03 04:00

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

arXiv:2606.03624v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance …

arXiv cs.AI TIER_1 English(EN) · Chuang Yu, Jinmiao Zhao, Mingxuan Zhao, Yunpeng Liu, Xiujun Shu, Yuanhao Feng, Bo Wang, Xiangyu Yue · 2026-06-03 04:00

MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models

arXiv:2512.05530v2 Announce Type: replace Abstract: Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and susceptibility to misleadi…

arXiv cs.AI TIER_1 English(EN) · Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan · 2026-06-03 04:00

Quantifying Faithful Confidence Expression in Large Reasoning Models

arXiv:2606.03969v1 Announce Type: cross Abstract: Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This ch…

arXiv cs.AI TIER_1 English(EN) · Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley · 2026-06-03 04:00

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

arXiv:2606.03965v1 Announce Type: cross Abstract: Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking l…

arXiv cs.AI TIER_1 English(EN) · Dongwon Jung, Peng Shi, Yi Zhang, Junshan Zhang, Muhao Chen · 2026-06-03 04:00

Adaptive Latent Agentic Reasoning

arXiv:2606.02871v1 Announce Type: cross Abstract: Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at e…

arXiv cs.AI TIER_1 English(EN) · Eric Cho, Shawn Huang, Alice Lu, Andy Lyu · 2026-06-03 04:00

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

arXiv:2606.03918v1 Announce Type: new Abstract: AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that …

arXiv cs.AI TIER_1 English(EN) · Ayushi Chadha · 2026-06-03 04:00

When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

arXiv:2606.03741v1 Announce Type: new Abstract: Long-horizon reasoning requires a system to commit to medium-horizon intent without becoming rigid: re-plan too often and computation never coheres into multi-step structure; commit too long and the plan goes stale. We study this st…

arXiv cs.AI TIER_1 English(EN) · Hongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li Yuan · 2026-06-03 04:00

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

arXiv:2606.03660v1 Announce Type: new Abstract: Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while…

arXiv cs.AI TIER_1 English(EN) · Simone Caldarella, Davide Talon, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini · 2026-06-03 04:00

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

arXiv:2606.02835v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. …

arXiv cs.AI TIER_1 English(EN) · Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen · 2026-06-03 04:00

Inducing Reasoning Primitives from Agent Traces

arXiv:2606.02994v1 Announce Type: new Abstract: ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful R…

arXiv cs.AI TIER_1 English(EN) · Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen · 2026-06-03 04:00

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

arXiv:2606.03503v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream …

arXiv cs.LG TIER_1 English(EN) · Aijia Cheng, Kailong Wang, Ling Shi, Yongxin Zhao · 2026-06-03 04:00

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

arXiv:2604.20316v2 Announce Type: replace Abstract: Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning…

arXiv cs.LG TIER_1 English(EN) · Ziyue Wang, Aomufei Yuan, Yongfu Zhu, Shuai Dong, Wenpu Liu, Yiran Yao, Weichu Xie, Yuqi Xu, Caoyuan Ma, Wenqi Shao, Xiaoying Zhang, Nan Duan, Jiaqi Wang · 2026-06-03 04:00

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

arXiv:2606.03234v1 Announce Type: new Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring t…

arXiv cs.CL TIER_1 English(EN) · Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang · 2026-06-03 04:00

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

arXiv:2604.16029v2 Announce Type: replace Abstract: Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fr…

arXiv cs.CL TIER_1 English(EN) · Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen · 2026-06-03 04:00

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

arXiv:2606.03603v1 Announce Type: cross Abstract: World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, w…

arXiv cs.AI TIER_1 English(EN) · Dani Roytburg, Shreya Sridhar, Daphne Ippolito · 2026-06-03 04:00

Measuring Weak-to-Strong Legibility of Reasoning Models

arXiv:2603.20508v2 Announce Type: replace-cross Abstract: Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups such as inter-model monitoring or distillation into smaller models. When agents at d…

arXiv cs.AI TIER_1 English(EN) · Xinwu Ye, Yicheng Mao, Yuxuan Liao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang · 2026-06-03 04:00

LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning

arXiv:2602.07075v5 Announce Type: replace-cross Abstract: Current chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) to solve complex reasoning problems. However, forcing nonverbal tacit chemical logic into discrete natural language impose…

arXiv cs.AI TIER_1 English(EN) · Yuchen Yan, Liang Jiang, Jin Jiang, Shuaicheng Li, Zujie Wen, Zhiqiang Zhang, Jun Zhou, Jian Shao, Yueting Zhuang, Yongliang Shen · 2026-06-03 04:00

InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

arXiv:2602.06960v3 Announce Type: replace-cross Abstract: Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost, context length limits, and degraded reasoning due to lost-in-the-middle effects. …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-03 00:00

DAR: Deontic Reasoning with Agentic Harnesses

Deontic reasoning tasks require applying complex rules and policies, and an agentic approach enables models to dynamically access statutes, showing mixed performance improvements across different model strengths.

arXiv cs.AI TIER_1 English(EN) · Arman Cohan · 2026-06-02 17:53

Quantifying Faithful Confidence Expression in Large Reasoning Models

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), …

arXiv cs.AI TIER_1 English(EN) · Julian McAuley · 2026-06-02 17:51

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressin…

arXiv cs.AI TIER_1 English(EN) · Andy Lyu · 2026-06-02 17:11

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks …

arXiv cs.AI TIER_1 English(EN) · Ayushi Chadha · 2026-06-02 14:55

When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

Long-horizon reasoning requires a system to commit to medium-horizon intent without becoming rigid: re-plan too often and computation never coheres into multi-step structure; commit too long and the plan goes stale. We study this stability-adaptivity tradeoff in the latent reason…

arXiv cs.AI TIER_1 English(EN) · Eric Gaussier · 2026-06-02 14:40

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering thes…

arXiv cs.AI TIER_1 English(EN) · Li Yuan · 2026-06-02 13:47

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing…

arXiv cs.AI TIER_1 English(EN) · Xian Wu · 2026-06-02 13:23

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formali…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 13:23

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formali…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 13:07

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Controlled concrete reasoning combines visual simulation with abstract reasoning through a training method that uses privileged future information to improve prediction accuracy and robustness.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 13:07

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, g…

arXiv cs.CL TIER_1 English(EN) · Jianbing Shen · 2026-06-02 13:07

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, g…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 11:21

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold addresses over-thinking in large reasoning models by using fine-grained preference learning to identify and eliminate redundant explorations in chain-of-thought reasoning processes.

arXiv cs.CL TIER_1 English(EN) · Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu, Zhaohe Liao, Yibo Miao, Li Niu · 2026-06-02 04:00

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

arXiv:2606.01243v1 Announce Type: new Abstract: Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought …

arXiv cs.LG TIER_1 English(EN) · Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao · 2026-06-02 04:00

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

arXiv:2603.03031v2 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) hav…

arXiv cs.LG TIER_1 English(EN) · Sanae Lotfi, Polina Kirichenko, Steven Li, Zechun Liu · 2026-06-02 04:00

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

arXiv:2606.00206v1 Announce Type: new Abstract: Post-training quantization (PTQ) is widely used to deploy large language models efficiently, but its effect on reasoning models is not well understood. Across math, coding, and science QA, we find that aggressive PTQ reduces accurac…

arXiv cs.LG TIER_1 English(EN) · Arif Hassan Zidan, Yi Pan, Hanqi Jiang, Ruiyu Yan, Wei Ruan, Zihao Wu, Lifeng Chen, Weihang You, Xinliang Li, Bowen Chen, Huawen Hu, Peilong Wang, Sizhuang Liu, Jing Zhang, Siyuan Li, Zhengliang Liu, Yu Bao, Lin Zhao, Lichao Sun, Dajiang Zhu, Xiang Li, J… · 2026-06-02 04:00

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

arXiv:2606.00133v1 Announce Type: new Abstract: World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within lea…

arXiv cs.CL TIER_1 English(EN) · Kasidit Sermsri, Teerapong Panboonyuen · 2026-06-02 04:00

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning

arXiv:2605.13136v2 Announce Type: replace Abstract: Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing rea…

arXiv cs.CL TIER_1 English(EN) · Songze Li, Zhiqiang Liu, Zhaoyan Gong, Xiaoke Guo, Zhongpu Bo, Zhengke Gui, Lei Liang, Huajun Chen, Wen Zhang · 2026-06-02 04:00

Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning

arXiv:2511.07910v2 Announce Type: replace Abstract: Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consiste…

arXiv cs.CL TIER_1 English(EN) · Tyler A. Chang, Catherine Arnett, Abdelrahman Sadallah, Abdelrahman Eldesokey, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarimayum Meerajita Sharma, Aditi Gupta, Adril Putra Merin, Adwoa Bremang, Afit… · 2026-06-02 04:00

Global PIQA: Evaluating Commonsense Reasoning Across 100+ Languages and Cultures

arXiv:2510.24081v2 Announce Type: replace Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense re…

arXiv cs.CL TIER_1 English(EN) · Liang Chen, Xueting Han, Li Shen, Jing Bai, Kam-Fai Wong · 2026-06-02 04:00

Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning

arXiv:2509.06948v3 Announce Type: replace Abstract: Supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) are two widely used post-training paradigms for improving the reasoning ability of large language models (LLMs). Recent methods attempt to in…

arXiv cs.CL TIER_1 English(EN) · Sharath Sathish · 2026-06-02 04:00

Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya

arXiv:2604.04937v1 Announce Type: cross Abstract: Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degrad…

arXiv cs.CL TIER_1 English(EN) · Shashi Kumar, Yacouba Kaloga, Petr Motlicek, Ina Kodrasi, Andrea Cavallaro · 2026-06-02 04:00

Geometric Latent Reasoning Induces Shorter Generations in LLMs

arXiv:2606.02248v1 Announce Type: new Abstract: Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent …

arXiv cs.CL TIER_1 English(EN) · Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu, Lei Liang, Wen Zhang · 2026-06-02 04:00

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

arXiv:2606.02170v1 Announce Type: new Abstract: Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as…

arXiv cs.CL TIER_1 English(EN) · Ahmed Elhady, Eneko Agirre, Mikel Artetxe · 2026-06-02 04:00

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

arXiv:2606.01464v1 Announce Type: new Abstract: Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL…

arXiv cs.CL TIER_1 English(EN) · Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu · 2026-06-02 04:00

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

arXiv:2606.01336v1 Announce Type: new Abstract: As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs wh…

arXiv cs.CL TIER_1 English(EN) · Ruiqi Zhang, Lingxiang Wang, Hainan Zhang Zhiming Zheng · 2026-06-02 04:00

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

arXiv:2606.00628v1 Announce Type: new Abstract: Self-distillation improves learning efficiency by rewriting reference answers as training data that better matches the model's own distribution. However, reference answers also introduce strong stylistic biases, causing the generati…

arXiv cs.AI TIER_1 English(EN) · Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac · 2026-06-02 04:00

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

arXiv:2511.16886v5 Announce Type: replace-cross Abstract: Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to co…

arXiv cs.AI TIER_1 English(EN) · Yoonjeon Kim, Doohyuk Jang, Eunho Yang · 2026-06-02 04:00

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

arXiv:2510.03259v2 Announce Type: replace-cross Abstract: Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. Whil…

arXiv cs.AI TIER_1 English(EN) · Jiwoong Sohn, Tomasz Sternal, Kenneth Styppa, Torsten Hoefler, Michael Moor · 2026-06-02 04:00

Process Reward Agents for Steering Knowledge-Intensive Reasoning

arXiv:2604.09482v2 Announce Type: replace Abstract: Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge…

arXiv cs.AI TIER_1 English(EN) · Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora · 2026-06-02 04:00

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

arXiv:2512.07795v2 Announce Type: replace Abstract: Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0…

arXiv cs.AI TIER_1 English(EN) · Yaoming Li, Guangxiang Zhao, Qilong Shi, Lin Sun, Xiangzheng Zhang, Tong Yang · 2026-06-02 04:00

A Primer in Post-Training Reasoning Data: What We Know About How It Works

arXiv:2606.02113v1 Announce Type: cross Abstract: Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly,…

arXiv cs.AI TIER_1 English(EN) · Dhruv Saini, Rohan Pandey · 2026-06-02 04:00

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

arXiv:2606.01080v1 Announce Type: cross Abstract: Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and de…

arXiv cs.AI TIER_1 English(EN) · Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding, Yining Sun · 2026-06-02 04:00

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

arXiv:2606.00674v1 Announce Type: cross Abstract: Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capab…

arXiv cs.AI TIER_1 English(EN) · Jiafu Huang, Chao Peng, Chenyang Xu, Zhengfeng Yang, Kecheng Cai, Chenhao Zhang, Yi Wang, Yiwei Gong, Wanqin Zhou, Irene Zheng · 2026-06-02 04:00

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

arXiv:2606.00559v1 Announce Type: cross Abstract: Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavior of classical rule-based algorithms. More specifically, the execution of such algorithms …

arXiv cs.AI TIER_1 English(EN) · Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov · 2026-06-02 04:00

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

arXiv:2606.02011v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup beca…

arXiv cs.AI TIER_1 English(EN) · Shayan Shokri · 2026-06-02 04:00

TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications

arXiv:2606.01520v1 Announce Type: new Abstract: A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot workspace, or a financial order book. The ingredients for doing so within any one domain alre…

arXiv cs.AI TIER_1 English(EN) · Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan · 2026-06-02 04:00

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

arXiv:2606.01462v1 Announce Type: new Abstract: Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning t…

arXiv cs.AI TIER_1 English(EN) · Teddy Ferdinan, Bart{\l}omiej Koptyra, Miko{\l}aj Langner, Tomasz Adamczyk, {\L}ukasz Radli\'nski, Maciej Markiewicz, Aleksander Szcz\k{e}sny, Stanis{\l}aw Wo\'zniak, Tymoteusz Romanowicz, Dzmitry Pihulski, Mateusz Zbrocki, Mateusz \'Smigielski, Micha{\l… · 2026-06-02 04:00

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

arXiv:2606.01145v1 Announce Type: new Abstract: While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in "hard science" fields. The slow -- or lack of -- adoption of RLMs in other branches of …

arXiv cs.AI TIER_1 English(EN) · Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li · 2026-06-02 04:00

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

arXiv:2606.00726v1 Announce Type: new Abstract: Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive…

arXiv cs.AI TIER_1 English(EN) · Alessio Bruno · 2026-06-02 04:00

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

arXiv:2606.00671v1 Announce Type: new Abstract: We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow s…

arXiv cs.AI TIER_1 English(EN) · Jayant Parashar, Suchendra M. Bhandarkar · 2026-06-02 04:00

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

arXiv:2606.00532v1 Announce Type: new Abstract: Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of lear…

arXiv cs.AI TIER_1 English(EN) · Dongxin Guo, Jikun Wu, Siu Ming Yiu · 2026-06-02 04:00

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

arXiv:2606.00376v1 Announce Type: new Abstract: Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an…

arXiv cs.AI TIER_1 English(EN) · Shunchi Zhang, Jin Lu, Chuanyang Jin, Yichao Zhou, Zhining Zhang, Tianmin Shu · 2026-06-02 04:00

MindZero: Learning Online Mental Reasoning With Zero Annotations

arXiv:2606.00240v1 Announce Type: new Abstract: Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robu…

arXiv cs.AI TIER_1 English(EN) · Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou · 2026-06-02 04:00

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

arXiv:2606.00103v1 Announce Type: new Abstract: We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden en…

arXiv cs.AI TIER_1 English(EN) · Gregory Magarshak · 2026-06-02 04:00

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

arXiv:2606.00050v1 Announce Type: new Abstract: We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays f…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 01:11

Inducing Reasoning Primitives from Agent Traces

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves,…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering (ACTS) formulates reasoning steering as a Markov decision process to enable efficient, controllable chain-of-thought reasoning with token savings.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-02 00:00

Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Prompt-Level Distillation extracts reasoning patterns from teacher models to enhance student model performance while maintaining interpretability and reducing latency.

arXiv cs.CL TIER_1 English(EN) · Andrea Cavallaro · 2026-06-01 13:40

Geometric Latent Reasoning Induces Shorter Generations in LLMs

Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, deter…

arXiv cs.CL TIER_1 English(EN) · Wen Zhang · 2026-06-01 12:29

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a prominent research trend, aiming to answer na…

arXiv cs.AI TIER_1 English(EN) · Tong Yang · 2026-06-01 11:45

A Primer in Post-Training Reasoning Data: What We Know About How It Works

Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across data…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 10:04

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflat…

arXiv cs.AI TIER_1 English(EN) · Yu Zhao, Hao Guan, Yongcheng Jing, Ying Zhang, Dacheng Tao · 2026-06-01 04:00

MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation

arXiv:2602.07905v2 Announce Type: replace Abstract: Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling laws. While existing studies augment LLMs with various knowledge types, it remains unclear …

arXiv cs.AI TIER_1 English(EN) · Arya Fayyazi, Mehdi Kamal, Massoud Pedram · 2026-06-01 04:00

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

arXiv:2605.30641v1 Announce Type: cross Abstract: Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at de…

arXiv cs.AI TIER_1 English(EN) · Tom Pecher · 2026-06-01 04:00

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

arXiv:2605.30391v1 Announce Type: cross Abstract: Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than re…

arXiv cs.AI TIER_1 English(EN) · Saku Peltonen, August B{\o}gh R{\o}nberg, Andreas Plesner, Roger Wattenhofer · 2026-06-01 04:00

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

arXiv:2605.31031v1 Announce Type: new Abstract: Relational reasoning lies at the heart of intelligence, but existing benchmarks are typically confined to formats such as grids or text. We introduce GraphARC, a benchmark for abstract reasoning on graph-structured data. GraphARC ge…

arXiv cs.AI TIER_1 English(EN) · Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes, Taylor W. Killian, Fenglong Ma, Weitong Zhang, Porter Jenkins · 2026-06-01 04:00

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

arXiv:2605.30651v1 Announce Type: cross Abstract: We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or mo…

arXiv cs.AI TIER_1 English(EN) · Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw · 2026-06-01 04:00

Effective Reasoning Chains Reduce Intrinsic Dimensionality

arXiv:2602.09276v2 Announce Type: replace-cross Abstract: Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalizatio…

arXiv cs.AI TIER_1 English(EN) · Elchanan Mossel · 2026-06-01 04:00

The Refutability Gap: Challenges in Validating Reasoning by Large Language Models

arXiv:2601.02380v4 Announce Type: replace-cross Abstract: Recent reports claim that Large Language Models (LLMs) have achieved the ability to derive new science and exhibit human-level general intelligence. We argue that such claims are not rigorous scientific claims, as they do …

arXiv cs.AI TIER_1 English(EN) · Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, Linqi Song · 2026-06-01 04:00

Learning to Reason with Insight for Informal Theorem Proving

arXiv:2604.16278v2 Announce Type: replace Abstract: Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we ide…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-06-01 00:00

Geometric Latent Reasoning Induces Shorter Generations in LLMs

Geometric Latent Reasoning formulates latent reasoning as a geometric path-approximation problem in pretrained token-embedding space, enabling continuous intermediate reasoning states that reduce generation length while maintaining accuracy.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-31 00:00

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp adapts AttnComp for long-context processing by fine-tuning lightweight attention layers and implementing token-level chunking and positional reordering techniques.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-31 00:00

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Large reasoning models exhibit a significant gap between their ability to produce and evaluate reasoning, with models showing answer confirmation bias that prevents accurate reasoning evaluation.

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tianmin Shu · 2026-05-29 18:14

MindZero: Learning Online Mental Reasoning With Zero Annotations

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses;…

arXiv cs.CL TIER_1 English(EN) · Yueyang Wang, Jiawei Fu, Baolong Bi, Xili Wang, Xiaoqing Liu · 2026-05-29 04:00

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

arXiv:2601.20255v3 Announce Type: replace-cross Abstract: SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently …

arXiv cs.AI TIER_1 English(EN) · Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang · 2026-05-29 04:00

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

arXiv:2601.22139v2 Announce Type: replace-cross Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive …

arXiv cs.AI TIER_1 English(EN) · Jiayi Dai, Randy Goebel · 2026-05-29 04:00

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

arXiv:2601.22531v2 Announce Type: replace-cross Abstract: Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an …

arXiv cs.AI TIER_1 English(EN) · Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, Jennifer Neville · 2026-05-29 04:00

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

arXiv:2602.02909v2 Announce Type: replace Abstract: Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many …

arXiv cs.AI TIER_1 English(EN) · Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr · 2026-05-29 04:00

Benchmarking at the Edge of Comprehension

arXiv:2602.14307v3 Announce Type: replace Abstract: As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans…

arXiv cs.AI TIER_1 English(EN) · Yang Ouyang, Shuhang Lin, Jung-Eun Kim · 2026-05-29 04:00

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

arXiv:2605.29247v1 Announce Type: new Abstract: Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.…

arXiv cs.AI TIER_1 English(EN) · Yubo Li, Ramayya Krishnan, Rema Padman · 2026-05-29 04:00

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

arXiv:2605.29087v1 Announce Type: new Abstract: Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-o…

arXiv cs.AI TIER_1 English(EN) · Pedro Orvalho, Marta Kwiatkowska, Guillem Aleny\`a, Felip Many\`a · 2026-05-29 04:00

Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

arXiv:2605.29687v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints and user-defined preferences, which commonly arise in domains such as robotics. We propose a hy…

arXiv cs.AI TIER_1 English(EN) · Venkat Akhil Lakkapragada · 2026-05-29 04:00

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

arXiv:2605.28919v1 Announce Type: cross Abstract: Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact l…

arXiv cs.AI TIER_1 English(EN) · Nishal Thomas, Noel Thomas · 2026-05-29 04:00

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

arXiv:2605.29001v1 Announce Type: cross Abstract: A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ran…

arXiv cs.AI TIER_1 English(EN) · Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss · 2026-05-29 04:00

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

arXiv:2605.29126v1 Announce Type: cross Abstract: A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer…

arXiv cs.AI TIER_1 English(EN) · Lukas Aichberger, Sepp Hochreiter · 2026-05-29 04:00

Unlocking the Working Memory of Large Language Models for Latent Reasoning

arXiv:2605.30343v1 Announce Type: cross Abstract: To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and ther…

arXiv cs.AI TIER_1 English(EN) · G M Shahariar, Erfan Shayegani, Ali Nazari, Nael Abu-Ghazaleh · 2026-05-29 04:00

Modeling Hierarchical Thinking in Large Reasoning Models

arXiv:2510.22437v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and r…

arXiv cs.AI TIER_1 English(EN) · Xinyu Liu, Xin Liu, Bo Jin, Runsong Zhao, Pengcheng Huang, Junhao Ruan, Bei Li, Chunyang Xiao, Chenglong Wang, Tong Xiao, Jingbo Zhu · 2026-05-29 04:00

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

arXiv:2604.14889v2 Announce Type: replace Abstract: While chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning tasks, the linear growth of the KV cache leads to substantial memory and inference overhead. Existing approaches such as context compression and …

arXiv cs.AI TIER_1 English(EN) · Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang · 2026-05-29 04:00

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

arXiv:2605.07804v2 Announce Type: replace-cross Abstract: On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the t…

arXiv cs.CL TIER_1 English(EN) · Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid · 2026-05-29 04:00

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

arXiv:2605.29190v1 Announce Type: cross Abstract: Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model…

arXiv cs.CL TIER_1 English(EN) · Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Hejin Wang, Jiansheng Wei, Xiaojun Meng, Min Zhang · 2026-05-29 04:00

Mining or Synthesis? Rethinking Exploration Efficiency in Iterative Alignment of Mathematical Reasoning

arXiv:2602.05370v3 Announce Type: replace Abstract: Iterative Direct Preference Optimization (DPO) has emerged as a widely used paradigm for aligning Large Language Models on reasoning tasks. Existing approaches typically rely on Best-of-N sampling ($N\geq8$) to mine positive tra…

arXiv cs.LG TIER_1 English(EN) · Jonathan Williams, Esin Tureci · 2026-05-29 04:00

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

arXiv:2602.10520v3 Announce Type: replace Abstract: Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM re…

arXiv cs.CL TIER_1 English(EN) · Jia-Chen Zhang, Yu-Jie Xiong, Zheng Zhou · 2026-05-29 04:00

Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

arXiv:2604.06805v2 Announce Type: replace Abstract: Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths …

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-29 00:00

MindZero: Learning Online Mental Reasoning With Zero Annotations

MindZero presents a self-supervised reinforcement learning framework that enables multimodal large language models to perform efficient and robust online mental reasoning without requiring explicit mental state annotations.

arXiv cs.AI TIER_1 English(EN) · Sepp Hochreiter · 2026-05-28 17:59

Unlocking the Working Memory of Large Language Models for Latent Reasoning

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external c…

arXiv cs.AI TIER_1 English(EN) · Quanquan C. Liu · 2026-05-28 17:57

Reasoning with Sampling: Cutting at Decision Points

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasonin…

arXiv cs.AI TIER_1 English(EN) · Guha Balakrishnan · 2026-05-28 15:31

Conformal Certification of Reasoning Trace Prefixes

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees f…

arXiv cs.MA (Multiagent) TIER_1 English(EN) · Tom Pecher · 2026-05-28 12:07

Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate

Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial discourse, a framework known as the Argumentative Theory of Reasoning (ATR). Rather than relying on individual "intellectualist reasoners" as…

arXiv cs.AI TIER_1 English(EN) · Xue Wen Tan, Nathaniel Tan, Galen Lee, Stanley Kok · 2026-05-28 04:00

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

arXiv:2510.20665v3 Announce Type: replace Abstract: Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practice relies on expert rubrics, manual annotation, and slow pairwise judgments. Automated ef…

arXiv cs.AI TIER_1 English(EN) · Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan · 2026-05-28 04:00

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

arXiv:2605.28742v1 Announce Type: new Abstract: Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of trai…

arXiv cs.AI TIER_1 English(EN) · Biagio La Rosa, Leilani H. Gilpin · 2026-05-28 04:00

Guaranteed Optimal Compositional Explanations for Neurons

arXiv:2511.20934v2 Announce Type: replace Abstract: Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations and concepts through logical rules, typically computed via a search over all possible con…

arXiv cs.CL TIER_1 English(EN) · Shengmin Piao, Sanghyun Park · 2026-05-28 04:00

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

arXiv:2605.27934v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce Gen…

arXiv cs.AI TIER_1 English(EN) · Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang, Zhiyuan Sun, Daiting Shi · 2026-05-28 04:00

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

arXiv:2605.28713v1 Announce Type: new Abstract: Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-speci…

arXiv cs.AI TIER_1 English(EN) · Leizhen Zhang, Shuhan Chen, Sheng Chen · 2026-05-28 04:00

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

arXiv:2605.28602v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, toget…

arXiv cs.CL TIER_1 English(EN) · Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu · 2026-05-28 04:00

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

arXiv:2605.28014v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-…

arXiv cs.CL TIER_1 English(EN) · Yukyung Lee, Yumeng Shen, Jinhyeong Park, Hyein Yang, Jun-Hyung Park · 2026-05-28 04:00

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

arXiv:2605.28292v1 Announce Type: new Abstract: Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example c…

arXiv cs.LG TIER_1 English(EN) · Avidan Shah, Jannik Brinkmann, Rico Angell · 2026-05-28 04:00

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

arXiv:2605.28467v1 Announce Type: new Abstract: As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tu…

arXiv cs.AI TIER_1 English(EN) · Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue · 2026-05-28 04:00

Revealing Algorithmic Deductive Circuits for Logical Reasoning

arXiv:2605.27824v1 Announce Type: new Abstract: Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning…

arXiv cs.AI TIER_1 English(EN) · Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar · 2026-05-28 04:00

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

arXiv:2605.28365v1 Announce Type: new Abstract: Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. …

arXiv cs.AI TIER_1 English(EN) · Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao · 2026-05-28 04:00

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

arXiv:2605.28070v1 Announce Type: new Abstract: We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of …

arXiv cs.AI TIER_1 English(EN) · Navid Rezazadeh, Arash Gholami Davoodi · 2026-05-28 04:00

The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

arXiv:2605.27965v1 Announce Type: new Abstract: Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study this distinction through backtracking dynamics: local reconsideration, retraction, or re-derivat…

arXiv cs.AI TIER_1 English(EN) · Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo · 2026-05-28 04:00

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

arXiv:2605.28008v1 Announce Type: new Abstract: Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuni…

arXiv cs.AI TIER_1 English(EN) · Chien-Ping Lu · 2026-05-28 04:00

The Computational Boundary of Inference: Capability Internalization, Training, and the Turing Jump

arXiv:2605.27381v1 Announce Type: cross Abstract: Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capability without clearly distinguishing the underlying computational regimes. This paper give…

arXiv cs.AI TIER_1 English(EN) · Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus · 2026-05-28 04:00

Reasoning and Planning with Dynamically Changing Norms

arXiv:2605.27622v1 Announce Type: new Abstract: To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning has been less explored, only within communities of artificial agents, and has ignored the dynam…

arXiv cs.AI TIER_1 English(EN) · Judith E. Fan · 2026-05-27 17:01

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, ma…

arXiv cs.AI TIER_1 English(EN) · Daiting Shi · 2026-05-27 16:36

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities…

arXiv cs.AI TIER_1 English(EN) · Sheng Chen · 2026-05-27 15:18

Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover …

arXiv cs.LG TIER_1 English(EN) · Rico Angell · 2026-05-27 13:33

Mitigating Adaptive Attacks against Reasoning Models with Activation Consistency Training

As LLMs gain stronger reasoning capabilities, their extended chain-of-thought introduces new degrees of complexity for defending against adversarial jailbreaks and prompt injection. We study consistency training, a family of fine-tuning objectives that enforce identical behavior …

arXiv cs.CL TIER_1 English(EN) · Haitham Bou Ammar · 2026-05-27 11:59

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply c…

arXiv cs.CL TIER_1 English(EN) · Jun-Hyung Park · 2026-05-27 10:40

CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models

Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\texti…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 07:28

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the de…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 06:09

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 06:02

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, w…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 04:07

GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization

Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformu…

arXiv cs.CL TIER_1 English(EN) · Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu · 2026-05-27 04:00

Learning to Adapt SFT Data for Better Reasoning Generalization

arXiv:2605.26924v1 Announce Type: new Abstract: Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leve…

arXiv cs.AI TIER_1 English(EN) · Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son · 2026-05-27 04:00

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

arXiv:2605.11651v4 Announce Type: replace-cross Abstract: Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially…

arXiv cs.AI TIER_1 English(EN) · Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu · 2026-05-27 04:00

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

arXiv:2510.06843v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to dis…

arXiv cs.AI TIER_1 English(EN) · Hans Peter Lyngs{\o}e Raaschou-Jensen, Constanza Fierro, Anders S{\o}gaard · 2026-05-27 04:00

Real-Time Progress Prediction in Reasoning Language Models

arXiv:2506.23274v4 Announce Type: replace-cross Abstract: Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, th…

arXiv cs.AI TIER_1 English(EN) · Meghyn Bienvenu, Camille Bourgaux · 2026-05-27 04:00

Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract Argumentation

arXiv:2003.05746v4 Announce Type: replace-cross Abstract: In this paper, we explore the issue of inconsistency handling over prioritized knowledge bases (KBs), which consist of an ontology, a set of facts, and a priority relation between conflicting facts. In the database setting…

arXiv cs.AI TIER_1 English(EN) · Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira · 2026-05-27 04:00

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

arXiv:2605.26934v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning d…

arXiv cs.LG TIER_1 English(EN) · Alex Ayoub, Kavosh Asadi, Dale Schuurmans, Csaba Szepesv\'ari, Karim Bouyarmane · 2026-05-27 04:00

Learning to Reason Efficiently with Discounted Reinforcement Learning

arXiv:2510.23486v2 Announce Type: replace Abstract: Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. More broadly, in goal reaching sequential decision problems we often want to reach the goal quickly, and LRM reasoning can b…

arXiv cs.AI TIER_1 English(EN) · Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li · 2026-05-27 04:00

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

arXiv:2605.26733v1 Announce Type: cross Abstract: Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further r…

arXiv cs.AI TIER_1 English(EN) · Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song, Langzhou He, Kenan Kamel A Alghythee, Philip S. Yu · 2026-05-27 04:00

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

arXiv:2605.26362v1 Announce Type: cross Abstract: In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge i…

arXiv cs.AI TIER_1 English(EN) · Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han · 2026-05-27 04:00

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv:2605.26789v1 Announce Type: new Abstract: Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that th…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 01:30

Revealing Algorithmic Deductive Circuits for Logical Reasoning

Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it rema…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Research reveals a new failure mode in reasoning models where correct chain-of-thought reasoning leads to incorrect final answers under adversarial conditions, demonstrated through controlled experiments across multiple datasets and models.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

Revealing Algorithmic Deductive Circuits for Logical Reasoning

Large language models use specialized attention heads for retrieving factual information and integrating multi-step reasoning, with distinct neural mechanisms for local reasoning steps versus global strategy coordination.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-27 00:00

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Contrastive Reflection (CORE) improves language model reasoning by analyzing differences between successful and unsuccessful attempts to generate concise, interpretable insights that enable faster and more efficient self-improvement compared to traditional parametric and non-para…

arXiv cs.AI TIER_1 English(EN) · Hidetoshi Shimodaira · 2026-05-26 12:28

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward …

arXiv cs.CL TIER_1 English(EN) · Wenjun Wu · 2026-05-26 12:20

Learning to Adapt SFT Data for Better Reasoning Generalization

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision…

arXiv cs.CL TIER_1 English(EN) · Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang · 2026-05-26 04:00

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

arXiv:2601.14249v5 Announce Type: replace Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not n…

arXiv cs.LG TIER_1 English(EN) · Wenbo Pan, Zhichao Liu, Xianlong Wang, Haining Yu, Xiaohua Jia · 2026-05-26 04:00

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

arXiv:2602.01914v2 Announce Type: replace Abstract: Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face tw…

arXiv cs.CL TIER_1 English(EN) · Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei · 2026-05-26 04:00

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

arXiv:2508.19988v3 Announce Type: replace Abstract: Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills t…

arXiv cs.CL TIER_1 English(EN) · Hui Xie, Jie Liu, Ziyue Qiao, Joaquin Vanschore · 2026-05-26 04:00

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

arXiv:2605.25745v1 Announce Type: new Abstract: Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a p…

arXiv cs.CL TIER_1 English(EN) · Zongji Yu, Wenshui Luo, Yiliu Sun, Hao Fang, Runmin Cong, Chaochao Lu, Chen Gong · 2026-05-26 04:00

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

arXiv:2605.25443v1 Announce Type: new Abstract: Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-…

arXiv cs.CL TIER_1 Norsk(NO) · Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Leszek Rutkowski, Dacheng Tao · 2026-05-26 04:00

Better, Faster: Harnessing Self-Improvement in Large Reasoning Models

arXiv:2605.24998v1 Announce Type: new Abstract: Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short i…

arXiv cs.AI TIER_1 English(EN) · Serafim Batzoglou · 2026-05-26 04:00

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

arXiv:2602.18956v3 Announce Type: replace Abstract: We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with extensionally labeled target predicates, models must output a single first order logical …

arXiv cs.AI TIER_1 English(EN) · Szymon Bobek, {\L}ukasz Ba{\l}ec, Grzegorz J. Nalepa · 2026-05-26 04:00

Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints

arXiv:2511.20236v3 Announce Type: replace Abstract: Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to achieve a desired outcome. However, existing methods often neglect dependencies among feat…

arXiv cs.AI TIER_1 English(EN) · Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li · 2026-05-26 04:00

IPR-1: Interactive Physical Reasoner

arXiv:2511.15407v4 Announce Type: replace Abstract: Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more exp…

arXiv cs.AI TIER_1 English(EN) · Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltr\'an, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, And… · 2026-05-26 04:00

Teaching large language models to reason like expert diagnosticians

arXiv:2509.12194v2 Announce Type: replace Abstract: Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, fe…

arXiv cs.AI TIER_1 English(EN) · Qirun Dai, Xiao Liu, Jiawei Zhang, Dylan Zhang, Hao Peng, Chenhao Tan · 2026-05-26 04:00

Towards a Universal Causal Reasoner

arXiv:2605.24873v1 Announce Type: cross Abstract: Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training g…

arXiv cs.AI TIER_1 English(EN) · Hongbo Jin, Mingnan Zhu, Jingqi Tian, Xu Jiang, Zhongjing Du, Haoran Tang, Siyi Xie, Qiaoman Zhang, Jiayu Ding · 2026-05-26 04:00

Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis

arXiv:2605.25354v1 Announce Type: new Abstract: While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability to dynamically extract, internalize, and apply new knowledge from complex, task-specific con…

arXiv cs.AI TIER_1 English(EN) · Andrew Corbett, Archit Sood, Anna Tzatzopoulou, Sai-Aakash Ramesh, Tim Dodwell · 2026-05-26 04:00

Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

arXiv:2605.25230v1 Announce Type: new Abstract: Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The trick is to model reasoning trajectories with a latent dynamical system. We argue that the in…

arXiv cs.AI TIER_1 English(EN) · Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Sch\"olkopf, Ryan Cotterell · 2026-05-26 04:00

Learning to Reason Efficiently with A* Post-Training

arXiv:2605.24597v1 Announce Type: new Abstract: Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is t…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 11:57

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reaso…

arXiv cs.CL TIER_1 English(EN) · Joaquin Vanschore · 2026-05-25 11:57

Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains

Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reaso…

arXiv cs.CL TIER_1 English(EN) · Chen Gong · 2026-05-25 05:42

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent…

arXiv cs.LG TIER_1 English(EN) · Meir Roketlishvili, Semyon Semenov, Maksim Bobrin, Viktor Kovalchuk, Albert Baichorov, Abduragim Shtanchaev, Fakhri Karray, Dmitry V. Dylov, Martin Tak\'a\v{c}, Arip Asadulaev · 2026-05-25 04:00

Convex Compositional Reasoning Models

arXiv:2605.23395v1 Announce Type: new Abstract: Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is …

arXiv cs.LG TIER_1 English(EN) · Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan · 2026-05-25 04:00

Decoding the Critique Mechanism in Large Reasoning Models

arXiv:2603.16331v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothe…

arXiv cs.CL TIER_1 English(EN) · Zhe Yuan, Yipeng Zhou, Jinghan Li, Xinyuan Chen, Bowen Deng, Zhiqian Chen, Liang Zhao · 2026-05-25 04:00

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

arXiv:2605.19416v2 Announce Type: replace Abstract: Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajec…

arXiv cs.AI TIER_1 English(EN) · Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu · 2026-05-25 04:00

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

arXiv:2605.17770v2 Announce Type: replace Abstract: The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in c…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-25 00:00

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

A novel mono-anchored multi-source reasoning framework that uses dynamic anchors to quantify information gain and regulate modality interactions during reinforcement learning with verifiable rewards.

arXiv cs.LG TIER_1 English(EN) · Arip Asadulaev · 2026-05-22 09:04

Convex Compositional Reasoning Models

Compositional energy-based models can generalize to larger combinatorial reasoning problems by reusing a learned factor energy across many local constraints. In our paper, we show that a key bottleneck in compositional reasoning is not composition itself, but the non-convex geome…

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-22 00:00

Decoding the Critique Mechanism in Large Reasoning Models

Large Reasoning Models demonstrate hidden critique abilities that allow error recovery through internal mechanisms, identified via interpretable critique vectors that enhance error detection without additional training.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-20 00:00

Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning

Equilibrium Reasoners enable scalable reasoning through task-conditioned attractors that guide latent dynamical systems toward valid solutions, achieving significant accuracy improvements through iterative test-time computation.

Hugging Face Daily Papers TIER_1 English(EN) · 2026-05-19 06:10

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a mo…

arXiv cs.CV TIER_1 English(EN) · Hong Yang, Basura Fernando · 2026-06-17 04:00

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

arXiv:2606.17639v1 Announce Type: cross Abstract: Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observa…

arXiv cs.CV TIER_1 English(EN) · Basura Fernando · 2026-06-16 07:56

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question a…

arXiv cs.CV TIER_1 English(EN) · Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli · 2026-06-16 04:00

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

arXiv:2601.08010v2 Announce Type: replace Abstract: Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces diverge…

arXiv stat.ML TIER_1 English(EN) · Ousmane Amadou Dia · 2026-06-15 04:00

Adaptive Nucleus Truncation for Long-Form Reasoning

arXiv:2606.13982v1 Announce Type: new Abstract: Sampling plays an important role in long-form language-model reasoning. Over thousands of decoding steps, small changes in the candidate token set can compound into different reasoning trajectories, stability profiles, and final ans…

arXiv stat.ML TIER_1 English(EN) · Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong · 2026-06-15 04:00

Fractured Chain-of-Thought Reasoning

arXiv:2505.12992v4 Announce Type: replace-cross Abstract: Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-T…

arXiv stat.ML TIER_1 English(EN) · Ousmane Amadou Dia · 2026-06-12 00:02

Adaptive Nucleus Truncation for Long-Form Reasoning

Sampling plays an important role in long-form language-model reasoning. Over thousands of decoding steps, small changes in the candidate token set can compound into different reasoning trajectories, stability profiles, and final answers. Existing truncation methods such as top-$p…

arXiv cs.CV TIER_1 English(EN) · Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang · 2026-06-09 04:00

CRANE: Knowledge Editing for Reasoning MLLMs

arXiv:2606.09033v1 Announce Type: new Abstract: The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear succes…

arXiv cs.CV TIER_1 English(EN) · Leyi Wu, Yifan Zhao, Jinjie Zhang, Yinchuan Li, Ying-Cong Chen · 2026-06-02 04:00

The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge

arXiv:2606.00829v1 Announce Type: new Abstract: EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather t…

arXiv stat.ML TIER_1 English(EN) · Felix Zhou, Anay Mehrotra, Quanquan C. Liu · 2026-05-29 04:00

Reasoning with Sampling: Cutting at Decision Points

arXiv:2605.30327v1 Announce Type: cross Abstract: Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-call…

arXiv stat.ML TIER_1 English(EN) · Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan · 2026-05-29 04:00

Conformal Certification of Reasoning Trace Prefixes

arXiv:2605.30085v1 Announce Type: cross Abstract: Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire resp…

arXiv cs.CV TIER_1 English(EN) · Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun · 2026-05-26 04:00

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

arXiv:2605.25437v1 Announce Type: new Abstract: Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of inform…

Together AI blog TIER_1 English(EN) · 2025-10-22 00:00

Large Reasoning Models Fail to Follow Instructions During Reasoning: A Benchmark Study

ReasonIF finds frontier LRMs fail to follow reasoning instructions >75% of the time; introduces a benchmark across languages, formatting, and length.

MarkTechPost TIER_1 English(EN) · Asif Razzaq · 2026-06-19 22:06

VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline

<p>VibeThinker-3B, a 3B MIT-licensed reasoning model matching DeepSeek V3.2 and Kimi K2.5 on verifiable benchmarks.</p> <p>The post <a href="https://www.marktechpost.com/2026/06/19/vibethinker-3b-a-3b-dense-reasoning-model-built-on-qwen2-5-coder-3b-with-the-spectrum-to-signal-pos…

Medium — fine-tuning tag TIER_1 English(EN) · Dave R - Microsoft Azure & AI MVP☁️ · 2026-06-18 15:28

Post-Training Open Source Reasoning Models in Microsoft Foundry: From Production Traces to a…

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/codex/post-training-open-source-reasoning-models-in-microsoft-foundry-from-production-traces-to-a-0362349438a0?source=rss------fine_tuning-5"><img src="https://cdn-images-1.medium.com/max/1536/…

Towards AI TIER_1 English(EN) · Nehdiii · 2026-06-09 22:01

Can Reinforcement Learning Help LLMs Discover New Reasoning Strategies?

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/can-reinforcement-learning-help-llms-discover-new-reasoning-strategies-f50b1b054ec7?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1790/0*dn25jRvK-xOFGOd6.…

dev.to — MCP tag TIER_1 English(EN) · curatedmcp · 2026-06-04 10:37

Sequential Thinking MCP: Break Down Hard Problems Into Solvable Steps

<blockquote> <p><em>Install guide and config at <a href="https://curatedmcp.com/install/sequential-thinking-mcp/claude-desktop" rel="noopener noreferrer">curatedmcp.com</a></em></p> </blockquote> <h1> Sequential Thinking MCP: Break Down Hard Problems Into Solvable Steps </h1> <p>…

Towards AI TIER_1 English(EN) · Faheem Munshi · 2026-06-03 12:01

Chain-of-Thought Prompting: Getting AI to Reason Step by Step — Prompt to Profit · Day 11 of 30

<h4><em>The single technique that separates AI users who get plausible answers from those who get genuinely intelligent ones.</em></h4><p>Welcome to Week 3. For the past two weeks, you’ve been building your foundation — prompting structure, templates, roles, workflows. Today we s…

Medium — Claude tag TIER_1 English(EN) · Chris Jones · 2026-06-02 12:34

Democratizing the Expression of Deterministic Logic with AI Weldr

<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@jzace42011/democratizing-the-expression-of-deterministic-logic-with-ai-weldr-2d2699db4578?source=rss------claude-5"><img src="https://cdn-images-1.medium.com/max/862/1*VS1Lu1HNhBVyG5RGbBS0Pw.p…

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-21 12:46

Reasoning models like o1 and DeepSeek-R1 differ from standard LLMs by generating an explicit chain of thought at inference time — here is how that architecture

Reasoning models like o1 and DeepSeek-R1 differ from standard LLMs by generating an explicit chain of thought at inference time — here is how that architecture actually works. https://www. nerdheadz.com/blog/reasoning-m odels-explained-o1-deepseek-r1-rlms # ai # machinelearning

LINKS nerdheadz.com/…/reasoning-models-explaine…

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-19 22:06

VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline VibeThinker-3B, a 3B MIT-licensed reason

VibeThinker-3B: A 3B Dense Reasoning Model Built on Qwen2.5-Coder-3B With the Spectrum-to-Signal Post-Training Pipeline VibeThinker-3B, a 3B MIT-licensed reasoning model matching DeepSeek V3.2 and ... #AI #Paper #Summary #AI #Shorts #Applications #Artificial #Intelligence #Editor…

LINKS awakari.com/sub-details.html awakari.com/pub-msg.html

dev.to — LLM tag TIER_1 English(EN) · Michael "Mike" K. Saleme · 2026-06-15 15:05

When the guardrail becomes the target: reasoning-extension DoS against LLM safety layers

<p>New research from HKUST (<a href="https://arxiv.org/abs/2606.14517" rel="noopener noreferrer">arXiv:2606.14517</a>, June 12) turns the agent safety layer into the attack surface.</p> <h2> What happened </h2> <p>Reasoning-based guardrails — the LLM safety layers that screen an …

r/MachineLearning TIER_1 English(EN) · /u/Future_Caregiver_643 · 2026-06-14 22:38

I built an open-source Knowledge Graph pipeline with hybrid retrieval to improve LLM multi-hop reasoning [P]

<div class="md"><p>Hey everyone,</p> <p>I built an open-source full-stack pipeline (Django + React) that constructs a Knowledge Graph from raw text, detects thematic communities, and uses hybrid search to solve the "lost in the middle" problem in standard…

dev.to — LLM tag TIER_1 English(EN) · Gabriel Anhaia · 2026-06-13 10:46

Chain-of-Thought When It Hurts: 3 Tasks Where Reasoning Backfires

<ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GX38N645" rel="noopener noreferrer">Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) — <a href="http…

Mastodon — fosstodon.org TIER_1 English(EN) · [email protected] · 2026-06-09 13:12

"Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery" We critically asse

"Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery" We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, …

LINKS arxiv.org/…/2606.08728 github.com/…/awesome-AI4Math

dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 03:20

Value Functions Over Reasoning Traces

<p>In <a href="https://metafunctor.com/post/2024-10-15-latent-reasoning-traces/" rel="noopener noreferrer">Latent Reasoning Traces</a>, I described a simple system: store successful reasoning traces, retrieve similar ones, use them to scaffold new problems. The traces serve as le…

dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 03:05

MCTS-Reasoning: Tree Search for LLM Reasoning

<p>I've been working on applying Monte Carlo Tree Search to LLM reasoning. The idea: multi-step reasoning is a sequential decision problem, and MCTS is good at those.</p> <h2> The Problem with Single-Shot Reasoning </h2> <p>When you ask an LLM a hard question, it generates one re…

dev.to — LLM tag TIER_1 English(EN) · Alex Towell · 2026-06-07 02:58

Latent Reasoning Traces: Memory as Learned Prior

<p>Every time you ask an LLM a question, it reasons from scratch. All that computation (the chain of thought, the intermediate steps, the successful pattern that led to a correct answer) evaporates the moment the response is complete.</p> <p>The model doesn't learn from its own s…

dev.to — LLM tag TIER_1 English(EN) · keeper · 2026-06-05 00:51

Gemma 4 12B: The Hidden Reasoning Tax

<h1> Gemma 4 12B: The Hidden Reasoning Tax </h1> <h2> Motivation </h2> <p>I recently acquired an RTX 5060 Ti 16GB for local LLM inference and wanted to find the best model for my use case: technical writing, code generation, and analysis in Chinese. Google's Gemma 4 12B seemed li…

dev.to — LLM tag TIER_1 English(EN) · pixelbank dev · 2026-06-01 23:10

Knowledge Distillation — Deep Dive + Problem: Roman to Integer

<p><em>A daily deep dive into llm topics, coding problems, and platform features from <a href="https://pixelbank.dev" rel="noopener noreferrer">PixelBank</a>.</em></p> <h2> Topic Deep Dive: Knowledge Distillation </h2> <p><em>From the Deployment & Optimization chapter</em></p…

r/MachineLearning TIER_1 English(EN) · /u/zdeneklapes · 2026-06-01 16:23

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

<div class="md"><p>Hello,</p> <p>I have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call…

dev.to — LLM tag TIER_1 English(EN) · Алексей Гормен · 2026-05-29 05:34

Your AI Has Two Brains: Fast Pattern Mode and the A11 Deep Reasoning Engine

<p>In most tasks, a system relies on <strong>high‑speed thinking driven by attention vectors</strong> this is <em>intuition</em>.<br /><br /> It is a <strong>fast, energy‑efficient, pattern‑oriented mode</strong>, which can be described as:</p> <p><strong>Fast Pattern Heuristics …

r/MachineLearning TIER_1 English(EN) · /u/Sensitive_Air_5745 · 2026-05-26 15:33

Verbosity is not faithfulness: an architectural argument that reasoning models cannot perform faithful inference [D]

<div class="md"><p>Essay argues that reasoning models cannot perform faithful inference because their reasoning trace and final answer come from the same operation. Engages with Lanham/Turpin/Mirzadeh in empirical critique, and with HRM, TRM, GRAM, AlphaProof, and …

COVERAGE [421]