PulseAugur
实时 09:00:40
English(EN) NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

新研究通过推测性方法和物理洞察力提升LLM推理能力

近期研究探索了增强大型语言模型(LLM)推理能力和效率的新方法。论文介绍了诸如思维之树(Tree-of-Thought)推理的推测性探索等技术,以打破同步瓶颈并实现显著的加速。其他工作则侧重于通过在推理时修剪错误的工具调用来改进工具集成推理,并开发机器人框架,使其在执行物理操作前能在潜在空间中进行物理推理。此外,研究还调查了辩论和投票等不同推理协议对LLM的有效性,发现虽然某些方法提高了安全性,但并不总是能增强实用性。 AI

影响 用于高效推理和工具集成的新方法可以增强LLM在复杂任务中的性能和适用性。

排序理由 多篇arXiv论文和博客文章详细介绍了关于LLM推理技术和基准测试的新研究。

在 Hugging Face Blog 阅读 →

AI 生成摘要 · Google Gemini · 来自 220 个来源。 我们如何撰写摘要 →

新研究通过推测性方法和物理洞察力提升LLM推理能力

报道来源 [220]

  1. Hugging Face Blog TIER_1 English(EN) ·

    Apriel-H1:提炼高效推理模型的惊人关键

  2. Hugging Face Blog TIER_1 English(EN) ·

    Kimina-Prover:在大型形式化推理模型上应用测试时强化学习搜索

  3. Hugging Face Blog TIER_1 English(EN) ·

    DABStep:多步推理的数据代理基准

  4. Hugging Face Blog TIER_1 English(EN) ·

    欢迎 Llama 3 - Meta 的新开源大语言模型

  5. Hugging Face Blog TIER_1 English(EN) ·

    NPHardEval 排行榜:通过复杂性类别和动态更新揭示大型语言模型的推理能力

  6. arXiv cs.AI TIER_1 English(EN) · Jingbo Shang ·

    OpenDeepThink: 通过 Bradley--Terry 聚合实现并行推理

    Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing t…

  7. arXiv cs.LG TIER_1 English(EN) · Bernd Finkbeiner ·

    自然合成:大型推理模型超越反应式合成工具

    Reactive synthesis, the problem of automatically constructing a hardware circuit from a logical specification, is a long-standing challenge in formal verification. It is elusive for two reasons: It is algorithmically hard, and writing formal specifications by hand is notoriously …

  8. Hugging Face Daily Papers TIER_1 English(EN) ·

    通过闭环验证推理解锁复杂视觉生成

    Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are…

  9. arXiv cs.AI TIER_1 English(EN) · Xiang Wang ·

    通过并行搜索和显式合并扩展检索增强推理

    Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This ma…

  10. arXiv cs.CL TIER_1 English(EN) · Yu Cheng ·

    通过简单统一的缩放实现金牌级奥林匹克竞赛推理

    Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems.…

  11. arXiv cs.CL TIER_1 English(EN) · Teerapong Panboonyuen ·

    GateKD:用于鲁棒推理的置信门控闭环蒸馏

    Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher-student interactions. Existing reasoning distillation methods, including mentor-based …

  12. Hugging Face Daily Papers TIER_1 English(EN) ·

    GRACE:面向高效微调的梯度对齐推理数据整理

    Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-al…

  13. arXiv cs.AI TIER_1 English(EN) · Paria Rashidinejad ·

    解决循环:语言与推理的吸引子模型

    Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to …

  14. arXiv cs.CL TIER_1 English(EN) · Jeany Son ·

    隐藏以见:视觉锚定思维的VLM蒸馏的推理前缀掩码

    Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer…

  15. arXiv cs.CL TIER_1 English(EN) · Jun Huang ·

    OmniThoughtVis:可部署多模态推理模型的可扩展蒸馏管线

    Recent multimodal large language models (MLLMs) have shown strong chain-of-thought (CoT) reasoning ability on vision-language tasks, but their direct deployment in real-world systems is often limited by latency and resource constraints. In practice, smaller MLLMs are preferred fo…

  16. Hugging Face Daily Papers TIER_1 English(EN) ·

    第一滴墨水:长上下文推理中误导性信息的非线性影响

    As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet…

  17. arXiv cs.AI TIER_1 English(EN) · Kuan-Hao Huang ·

    第一滴墨水:长上下文推理中误导性信息的非线性影响

    As large language models are increasingly deployed in retrieval-augmented generation and agentic systems that accumulate extensive context, understanding how distracting information affects long-context performance becomes critical. Prior work shows that semantically relevant yet…

  18. arXiv cs.LG TIER_1 English(EN) · Meng Li ·

    突破奖励瓶颈:通过推测性探索加速思维之树推理

    Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronizat…

  19. arXiv cs.CL TIER_1 English(EN) · Shuhao Zhang ·

    PruneTIR:推理时工具调用剪枝,实现有效且高效的工具集成推理

    Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how …

  20. Hugging Face Daily Papers TIER_1 English(EN) ·

    PruneTIR:用于高效且有效的工具集成推理的推理时工具调用修剪

    Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how …

  21. arXiv cs.CL TIER_1 English(EN) · Hua Shen ·

    语言模型中的伪推理:当推理未能使价值观与行动保持一致时

    Large language models (LLMs) are often evaluated based on their stated values, yet these do not reliably translate into their actions, a discrepancy termed "value-action gap." In this work, we argue that this gap persists even under explicit reasoning, revealing a deeper failure …

  22. 量子位 (QbitAI) TIER_1 中文(ZH) · 思邈 ·

    具身大模型R1时刻:LIBERO Terminator,99.9%背后的物理推理新范式

    真正学会了在隐空间里进行“物理思考”

  23. arXiv cs.CL TIER_1 English(EN) · Kumar Lakshmipathi ·

    统计侦察发现辩论安全但非辩论有用案例:开放权重LLM推理协议的匹配上限研究

    When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched …

  24. arXiv cs.CL TIER_1 English(EN) · Yue Zhao ·

    思维链推理中的隐藏错误意识:信号是诊断性的,而非因果性的

    Chain-of-thought (CoT) prompting assumes that generated reasoning reflects a model's internal computation. We show this assumption is wrong in a specific, measurable way: models internally detect their own reasoning errors but outwardly express confidence in them. A linear probe …

  25. arXiv cs.CL TIER_1 English(EN) · Dajun Zhang ·

    并非所有思考都需要HBM:面向LLM推理的语义感知内存层次结构

    Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a di…

  26. arXiv cs.AI TIER_1 English(EN) · Dan O'Malley ·

    Rubric-Grounded RL:用于可泛化推理的结构化评判奖励

    We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formal…

  27. arXiv cs.AI TIER_1 English(EN) · Mark Coates ·

    基于概率常识的溯因推理

    Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps…

  28. arXiv cs.AI TIER_1 English(EN) · Jing Tang ·

    Prune-OPD:长时域推理的高效可靠的在线策略蒸馏

    On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses…

  29. arXiv cs.AI TIER_1 English(EN) · Jes Frellsen ·

    追踪语言模型“推理”中的不确定性

    Language model (LM) "reasoning", commonly described as Chain-of-Thought or test-time scaling, often improves benchmark performance, but the dynamics underlying this process remain poorly understood. We study these dynamics through the lens of uncertainty quantification by treatin…

  30. arXiv cs.CL TIER_1 English(EN) · Yunfang Wu ·

    并非所有Token都学得一样:注意力熵揭示了强化学习推理中的异构信号

    Reinforcement-learning-based post-training has become a key approach for improving the reasoning ability of large language models, but its token-level learning signals remain poorly understood. This work studies their heterogeneity through attention entropy, which measures how co…

  31. arXiv cs.CL TIER_1 English(EN) · Junpei Komiyama ·

    通过前缀一致性实现可靠的思维链

    Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, w…

  32. arXiv cs.CL TIER_1 English(EN) · Yujiu Yang ·

    Think-with-Rubrics:从外部评估者到内部推理指导

    Rubrics have been extensively utilized for evaluating unverifiable, open-ended tasks, with recent research incorporating them into reward systems for reinforcement learning. However, existing frameworks typically treat rubrics only as external evaluator disjointed from the policy…

  33. arXiv cs.CL TIER_1 English(EN) · Junnan Zhu ·

    LaTER:通过潜在探索和显式验证实现高效的测试时推理

    Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous stat…

  34. arXiv cs.CL TIER_1 English(EN) · Hung-yi Lee ·

    重新思考密集序列链:推理语言模型可从稀疏、顺序打乱的思维链中提取答案

    Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, an…

  35. arXiv cs.CL TIER_1 English(EN) · Fan Huang ·

    ReFlect:一种用于复杂长时LLM推理的有效约束系统

    arXiv:2605.05737v1 Announce Type: cross Abstract: Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across…

  36. arXiv cs.AI TIER_1 English(EN) · Marc Boubnovski Martell, Josefa Lia Stoisser, Kaspar M\"artens, Jialin Yu, Robert Kitchen, Philip Torr, Jesper Ferkinghoff-Borg ·

    通过推理轨迹衡量黑盒置信度:几何、覆盖率和语言化

    arXiv:2605.06308v1 Announce Type: new Abstract: Reliable confidence estimation enables safe deployment of chain-of-thought (CoT) reasoning through text-only APIs. Yet the dominant black-box baseline, self-consistency over K samples, is linearly expensive and ignores the geometry …

  37. arXiv cs.AI TIER_1 English(EN) · Richmond Sin Jing Xuan, Rishabh Bhardwaj, Soujanya Poria ·

    后推理:零成本提升非思考模型性能

    arXiv:2605.06165v1 Announce Type: new Abstract: As the widespread adoption of Large Language Models (LLMs) accelerates, token consumption from intermediate reasoning traces increasingly contributes to inference latency and operational cost. Recent studies suggest that many real-w…

  38. arXiv cs.AI TIER_1 English(EN) · Xiaomin Li, Jianheng Hou, Zheyuan Deng, Zhiwei Zhang, Taoran Li, Binghang Lu, Bing Hu, Yunhan Zhao, Yuexing Hao ·

    风险链:大型推理模型的安全故障及通过自适应多原则引导进行缓解

    arXiv:2605.05678v1 Announce Type: new Abstract: Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in re…

  39. arXiv cs.AI TIER_1 English(EN) · Sai Babu Patarlapalli, Surya Teja Avvaru ·

    BitCal-TTS:量化推理模型的比特校准测试时标度

    arXiv:2605.05561v1 Announce Type: new Abstract: Post-training quantization makes large reasoning models practical under tight memory and latency budgets, but it can distort the online signals that drive adaptive test-time compute allocation. Under a fixed cap on the number of new…

  40. arXiv cs.AI TIER_1 English(EN) · Mohamed Salim Aissi, Clemence Grislain, Clement Romac, Laure Soulier, Mohamed Chetouani, Olivier Sigaud, Nicolas Thome ·

    PRISM: 顺序决策的感知推理交错

    arXiv:2605.05407v1 Announce Type: new Abstract: Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which of…

  41. arXiv cs.CL TIER_1 English(EN) · David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal ·

    用于可验证推理的多模态事实级归因

    arXiv:2602.11509v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and v…

  42. arXiv cs.CL TIER_1 English(EN) · Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov ·

    强化学习能否教会大型语言模型进行长时域推理?表现力是关键

    arXiv:2605.06638v1 Announce Type: cross Abstract: Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments.…

  43. arXiv cs.CL TIER_1 English(EN) · Jaehoon Kim, Dongha Lee ·

    OPSD 压缩 RLVR 的学习成果:用于推理模型的 RL 后压缩阶段

    arXiv:2605.06188v1 Announce Type: cross Abstract: On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-…

  44. arXiv cs.CL TIER_1 English(EN) · Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang, Yuchen Fan, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng, Yun Luo, Ganqu Cui ·

    教会思考模型使用工具进行推理:一个完整的工具集成推理流程配方

    arXiv:2605.06326v1 Announce Type: new Abstract: Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong th…

  45. arXiv cs.CL TIER_1 English(EN) · Xinyu Wang, Changzhi Sun, Lian Cheng, Yuanbin Wu, Dell Zhang, Xiaoling Wang, Xuelong Li ·

    逻辑正则化验证器从大型语言模型中引出推理

    arXiv:2605.05893v1 Announce Type: new Abstract: Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wep…

  46. arXiv cs.CL TIER_1 English(EN) · Nicole Lincoln, Nick Whitehouse, Jaron Mar, Rivindu Perera ·

    精选条款:比较大型语言模型与领域训练的小型语言模型在结构化合同提取方面的表现

    arXiv:2605.05532v1 Announce Type: new Abstract: This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixt…

  47. arXiv cs.LG TIER_1 English(EN) · Zijun Gao, Zhikun Xu, Xiao Ye, Ben Zhou ·

    CORE:面向概念的强化学习,用于弥合数学推理中的定义-应用鸿沟

    arXiv:2512.18857v3 Announce Type: replace-cross Abstract: Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelin…

  48. arXiv cs.LG TIER_1 English(EN) · Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi ·

    KaVa:通过压缩KV缓存蒸馏实现潜在推理

    arXiv:2510.02312v2 Announce Type: replace Abstract: Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifac…

  49. arXiv cs.LG TIER_1 English(EN) · Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev ·

    超越记忆:通过递归、记忆和测试时计算扩展推理深度

    arXiv:2508.16745v3 Announce Type: replace Abstract: Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation …

  50. arXiv cs.LG TIER_1 English(EN) · Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang, Jiaxin Huang ·

    胡说八道有帮助:提示空间扰动拓宽推理探索

    arXiv:2605.05566v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequen…

  51. arXiv cs.LG TIER_1 English(EN) · Aymen Echarghaoui, Dongxia Wu, Emily B. Fox ·

    BALAR:一种用于主动推理的贝叶斯代理循环

    arXiv:2605.05386v1 Announce Type: cross Abstract: Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled me…

  52. arXiv cs.LG TIER_1 English(EN) · Yuhang Lai, Jiazhan Feng, Yee Whye Teh, Ning Miao ·

    面向数学推理的验证器支持的难题生成

    arXiv:2605.06660v1 Announce Type: new Abstract: Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training a…

  53. arXiv cs.LG TIER_1 English(EN) · William T. Redman, Erik C. Johnson, Brian Robinson ·

    Transformer学到的捷径解决方案损害持续组合推理能力

    arXiv:2605.05495v1 Announce Type: new Abstract: Identifying and exploiting common features across domains is at the heart of the human ability to make analogies, and is believed to be crucial for the ability to continually learn. To do this successfully, general and flexible comp…

  54. arXiv cs.LG TIER_1 English(EN) · Pratik Deshmukh, Atirek Gupta ·

    关于语义损失微调方法以防止因果推理中的模型坍塌

    arXiv:2605.05438v1 Announce Type: new Abstract: Standard fine-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting "Yes" or "No" regardless of input structure. We demonstrate th…

  55. arXiv cs.AI TIER_1 English(EN) · Ning Miao ·

    面向数学推理的验证器支持的难题生成

    Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Exis…

  56. arXiv cs.AI TIER_1 English(EN) · Abulhair Saparov ·

    强化学习能否教会大型语言模型进行长时域推理?表现力是关键

    Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reas…

  57. arXiv cs.CL TIER_1 English(EN) · Ganqu Cui ·

    教会思考模型使用工具进行推理:面向工具集成推理的全流程方法

    Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. …

  58. arXiv cs.CL TIER_1 English(EN) · Dongha Lee ·

    OPSD 压缩 RLVR 的学习成果:用于推理模型的 RL 后压缩阶段

    On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However…

  59. arXiv cs.CL TIER_1 English(EN) · Xuelong Li ·

    逻辑正则化验证器引发大语言模型推理

    Verifiers are crucial components for enhancing modern LLMs' reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regulariz…

  60. arXiv cs.LG TIER_1 English(EN) · Ole-Christoffer Granmo, Youmna Abdelwahab, Per-Arne Andersen, Karl Audun K. Borgersen, Paul F. A. Clarke, Kunal Dumbre, Ylva Gr{\o}nnings{\ae}ter, Vojtech Halenka, Runar Helin, Lei Jiao, Ahmed Khalid, Rebekka Omslandseter, Rupsa Saha, Mayur Shende, Xuan Z ·

    Tsetlin Machine 深入:基于图的逻辑学习与推理

    arXiv:2507.14874v2 Announce Type: replace Abstract: Pattern recognition with concise and flat AND-rules makes the Tsetlin Machine (TM) both interpretable and efficient, while the power of Tsetlin automata enables accuracy comparable to deep learning on an increasing number of dat…

  61. arXiv cs.LG TIER_1 English(EN) · Igor Rivin ·

    使用代数陷阱探测语言模型中的结构化数学推理

    arXiv:2605.04352v1 Announce Type: new Abstract: We introduce a benchmark suite for evaluating structural mathematical reasoning in language models, built on subgroup-construction problems in SL(3, Z) with cryptographic-style verifier-prover asymmetry. Each instance presents a fin…

  62. arXiv cs.LG TIER_1 English(EN) · Khouloud Saadi, Di Wang ·

    Validity-Calibrated Reasoning Distillation

    arXiv:2605.04078v1 Announce Type: new Abstract: Reasoning distillation aims to transfer multi-step reasoning capabilities from large language models to smaller, more efficient ones. While recent methods have shown promising gains, they typically rely on static teacher-student hie…

  63. arXiv cs.CL TIER_1 English(EN) · Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang ·

    ReasoningGuard:通过推理时安全“顿悟”来保护大型推理模型

    arXiv:2508.04204v2 Announce Type: replace Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. C…

  64. arXiv cs.AI TIER_1 English(EN) · Yunjian Zhang, Sudong Wang, Yang Li, Peiran Xu, Conghao Zhou, Xiaoyue Ma, Jianing Li, Yao Zhu ·

    通过动态单次策略精炼实现资源高效的推理大型语言模型

    arXiv:2602.00815v2 Announce Type: replace Abstract: Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reaso…

  65. arXiv cs.AI TIER_1 English(EN) · Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder ·

    推理模型可通过思维链重构进行精确剪枝

    arXiv:2509.12464v2 Announce Type: replace Abstract: Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produce…

  66. arXiv cs.AI TIER_1 English(EN) · Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan ·

    ABox Abduction for Inconsistent Knowledge Bases under Repair Semantics

    arXiv:2605.01341v1 Announce Type: cross Abstract: Given a knowledge base (KB) with a non-entailed fact, the ABox abduction problem asks for possible extensions of the KB that would entail this fact. This problem has many applications, ranging from diagnosis to explainability and …

  67. arXiv cs.AI TIER_1 English(EN) · Kei Nishimura-Gasparian, Robert McCarthy, David Lindner ·

    走向理解推理模型中的规范博弈

    arXiv:2605.02269v1 Announce Type: new Abstract: Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where …

  68. arXiv cs.AI TIER_1 English(EN) · Eric H. C. Chow ·

    100万token上下文窗口中的检索与多跳推理:在古文文本上评估LLMs

    arXiv:2605.02173v1 Announce Type: new Abstract: We evaluate the long-context retrieval and reasoning capabilities of five frontier large language models with advertised 1M-token context windows on a classical Chinese corpus. Two complementary studies are reported. Test 1 measures…

  69. arXiv cs.AI TIER_1 English(EN) · Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu ·

    超越标量:通过几何进展和稳定性评估和理解LLM推理

    arXiv:2603.10384v2 Announce Type: replace Abstract: Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematic…

  70. arXiv cs.AI TIER_1 English(EN) · Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao ·

    SCALER: 用于推理的合成可扩展自适应学习环境

    arXiv:2601.04809v5 Announce Type: replace Abstract: Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progr…

  71. arXiv cs.CL TIER_1 English(EN) · Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia ·

    基于思考轨迹的RAG可提升推理任务表现

    arXiv:2605.03344v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumpti…

  72. arXiv cs.CL TIER_1 English(EN) · Daniel Drucker, Kyle Mahowald ·

    反例游戏:语言模型中的迭代概念分析与修复

    arXiv:2605.03936v1 Announce Type: new Abstract: Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: o…

  73. arXiv cs.CL TIER_1 English(EN) · Jiaqi Wei, Xuehang Guo, Pengfei Yu, Xiang Zhang, Wanli Ouyang, Siqi Sun, Qingyun Wang, Chenyu You ·

    何时思考,何时发言:为LLM推理学习披露策略

    arXiv:2605.03314v1 Announce Type: new Abstract: In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{…

  74. arXiv cs.CL TIER_1 English(EN) · Rose Sathyanathan, Kinshuk Vasisht, Danish Pruthi ·

    评估带预设问题的推理模型

    arXiv:2605.03050v1 Announce Type: new Abstract: Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fa…

  75. arXiv cs.LG TIER_1 English(EN) · Manuel Vargas Guzm\'an, Jakub Szymanik, Maciej Malicki ·

    混合模型用于自然语言推理:三段论逻辑的案例

    arXiv:2510.09472v2 Announce Type: replace-cross Abstract: Despite the remarkable progress in neural models, their ability to generalize, a cornerstone for applications such as logical reasoning, remains a critical challenge. We delineate two fundamental aspects of this ability: c…

  76. arXiv cs.AI TIER_1 English(EN) · Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, Chelsea Finn ·

    Poly-EPO:训练探索性推理模型

    arXiv:2604.17654v3 Announce Type: replace Abstract: Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for…

  77. arXiv cs.AI TIER_1 English(EN) · Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang ·

    通过促进思维链生成前的安全决策,迈向更安全的大型推理模型

    arXiv:2603.17368v2 Announce Type: replace Abstract: Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In t…

  78. arXiv cs.CL TIER_1 English(EN) · Kyle Mahowald ·

    反例博弈:语言模型中的迭代概念分析与修复

    Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a…

  79. arXiv cs.CL TIER_1 English(EN) · Matei Zaharia ·

    基于思考轨迹的RAG可提升推理任务表现

    Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG …

  80. arXiv cs.CL TIER_1 English(EN) · Yongrui Chen, Yangyang Ma, Xiaoying Huang, Shenyu Zhang, Huajun Chen, Haofen Wang, Guilin Qi ·

    StressEval:面向知识密集型推理的大型语言模型的故障驱动动态基准测试

    arXiv:2605.01939v1 Announce Type: new Abstract: Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the…

  81. arXiv cs.AI TIER_1 English(EN) · Yiyang Wei, Tingyu Song, Siyue Zhang, Yilun Zhao ·

    面向推理密集型检索的调查:进展与挑战

    arXiv:2605.00063v1 Announce Type: cross Abstract: Reasoning-Intensive Retrieval (RIR) targets retrieval settings where relevance is mediated by latent inferential links between a query and supporting evidence, rather than semantic similarity. Motivated by the emergent reasoning a…

  82. arXiv cs.AI TIER_1 English(EN) · Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong, Serge Panev, Chen Gong, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung, Alan Wee-Chung Liew, Shirui Pan ·

    G-reasoner:用于图结构知识统一推理的基础模型

    arXiv:2509.24276v4 Announce Type: replace Abstract: Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs…

  83. arXiv cs.AI TIER_1 English(EN) · Henry Han, Xiyang Liu, Xiaodong Wang, Fei Han, Xiaodong Li ·

    量化陷阱:打破多跳推理中的线性缩放定律

    arXiv:2602.13595v2 Announce Type: replace Abstract: Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile ($E \propto \mathrm{bits}$). In this paper, we demonstrate tha…

  84. arXiv cs.CL TIER_1 (AF) · Sangkwon Park, Donghun Kang, Jisoo Mok, Sungroh Yoon ·

    Verbal-R3:Verbal Reranker作为检索与推理之间的缺失桥梁

    arXiv:2605.01399v1 Announce Type: new Abstract: The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to b…

  85. arXiv cs.CL TIER_1 English(EN) · Zebin Guo, Weidong Geng, Ruichen Mao ·

    FT-RAG:用于复杂表格推理的细粒度检索增强生成框架

    arXiv:2605.01495v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding responses in external knowledge during inference. However, conventiona RAG systems under-perform on structured tabular data, largely due to coar…

  86. arXiv cs.CL TIER_1 English(EN) · Kwan Soo Shin ·

    推理陷阱:闭环多步大型语言模型推理的信息论界限

    arXiv:2605.01704v1 Announce Type: new Abstract: When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents itera…

  87. arXiv cs.CL TIER_1 English(EN) · Nikolaos Giarelis, Charalampos Mastrokostas, Nikos Karacapilidis ·

    Maistros:一种通过大型推理模型的知识蒸馏改编的希腊大型语言模型

    arXiv:2605.01870v1 Announce Type: new Abstract: Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their…

  88. arXiv cs.CL TIER_1 English(EN) · Yilei Chen, Sharut Gupta, Yannis Paschalidis, Ayush Sekhari, Aldo Pacchiano ·

    少即是多:通过协作推理实现高效推理

    arXiv:2605.01111v1 Announce Type: cross Abstract: In this work, we introduce DUET (Dual-model Efficient Two-stage inference), a collaborative inference framework in which a capable model and a lightweight model work together to solve a task. Relying on a single large model to per…

  89. arXiv cs.CL TIER_1 English(EN) · Munachiso Samuel Nwadike, Zangir Iklassov, Kareem Ali, Rifo Genadi, Kentaro Inui ·

    衡量人工智能推理:研究人员指南

    arXiv:2605.02442v1 Announce Type: cross Abstract: In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alon…

  90. arXiv cs.CL TIER_1 English(EN) · Tairan Fu, Javier Conde, Gonzalo Mart\'inez, Mar\'ia Grandury, Pedro Reviriego ·

    多项选择题:推理使大型语言模型(LLMs)更加自信,尤其是在它们出错时

    arXiv:2501.09775v3 Announce Type: replace Abstract: Multiple Choice Question (MCQ) tests are among the most used methods for evaluating large language models (LLMs). Besides checking the correctness of the selected answer, evaluations often consider the model's confidence through…

  91. arXiv cs.CL TIER_1 English(EN) · Xuan Shen, Yizhou Wang, Yufa Zhou, Xiangxi Shi, Pu Zhao, Yanzhi Wang, Jiuxiang Gu ·

    Efficient Reasoning with Hidden Thinking

    arXiv:2501.19201v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities in Multimodal Large Language Models (MLLMs). However, the verbose nature of textual reasoning introduces signifi…

  92. arXiv cs.CL TIER_1 English(EN) · Ren Zhuang ·

    自适应GoGI-Skip:结合目标梯度重要性与动态不确定性以实现高效推理

    arXiv:2505.08392v3 Announce Type: replace Abstract: Chain-of-Thought (CoT) prompting trades inference speed for reasoning accuracy. Existing compressors force a compromise as static gradient techniques treat tokens independently, severing sequential logic, while uncertainty-based…

  93. arXiv cs.CL TIER_1 English(EN) · Shanglin Wu, Lihui Liu, Jinho D. Choi, Kai Shu ·

    通过推理时知识图谱构建提高LLM的事实性

    arXiv:2509.03540v3 Announce Type: replace Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) paradigms mitigate this issue by incorporating external …

  94. arXiv cs.CL TIER_1 English(EN) · Qiuyu Tian, Zequn Liu, Yiding Li, Fengyi Chen, Zequn Liu, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia ·

    STAGE:一个用于推理不断演变故事的全剧本基准

    arXiv:2601.08510v3 Announce Type: replace Abstract: Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question ans…

  95. arXiv cs.CL TIER_1 English(EN) · Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless ·

    VERGE:可验证大语言模型推理的正式精炼与指导引擎

    arXiv:2601.20055v2 Announce Type: replace Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers t…

  96. arXiv cs.CL TIER_1 English(EN) · Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo, Baosong Yang, Yongliang Shen, Weiming Lu ·

    语言作为推理优化的潜在变量

    arXiv:2604.21593v2 Announce Type: replace Abstract: As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the …

  97. arXiv cs.CL TIER_1 English(EN) · Susmit Das ·

    TIME:用于上下文触发显式推理的时间智能元推理引擎

    arXiv:2601.05300v2 Announce Type: replace-cross Abstract: Reasoning-oriented language models typically expose explicit reasoning as a long, front-loaded chain of "thinking" tokens before the main output, either always enabled or externally toggled at inference time. Although this…

  98. arXiv cs.LG TIER_1 English(EN) · Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun ·

    STABLEVAL:AI系统的对抗性稳定评估

    arXiv:2605.02122v1 Announce Type: new Abstract: Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator…

  99. arXiv cs.LG TIER_1 English(EN) · Simone Papicchio, Simone Rossi, Luca Cagliero, Paolo Papotti ·

    Think2SQL:增强LLM的Text2SQL推理能力

    arXiv:2504.15077v5 Announce Type: replace Abstract: Large Language Models (LLMs) can translate natural language into SQL, but small models struggle with multi-table and complex queries in Zero-Shot Learning (ZSL) settings. While Supervised Fine-Tuning (SFT) helps, it falls short …

  100. arXiv cs.LG TIER_1 English(EN) · Lucas Dionisopoulos, Nicklas Majamaki, Prithviraj Ammanabrolu ·

    推理如何从训练后数据中演变:一项使用国际象棋的实证研究

    arXiv:2604.05134v2 Announce Type: replace Abstract: We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets influences language model performance in chess. …

  101. arXiv cs.CL TIER_1 English(EN) · Chenyu You ·

    何时思考,何时发言:为LLM推理学习披露策略

    In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early stream…

  102. arXiv cs.CL TIER_1 English(EN) · Danish Pruthi ·

    评估带预设问题的推理模型

    Millions of users turn to AI models for their information needs. It is conceivable that a large number of user queries contain assumptions that may be factually inaccurate. Prior work notes that large language models (LLMs) often fail to challenge such erroneous assumptions, and …

  103. arXiv cs.CL TIER_1 English(EN) · Kentaro Inui ·

    衡量人工智能推理:研究人员指南

    In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reason…

  104. Hugging Face Daily Papers TIER_1 English(EN) ·

    走向理解推理模型中的规范博弈

    Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it. To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended act…

  105. arXiv cs.CL TIER_1 English(EN) · Wenyuan Zhang, Shuaiyi Nie, Xinghua Zhang, Zefeng Zhang, Tingwen Liu ·

    探索大型推理模型的系统1思维能力

    arXiv:2504.10368v4 Announce Type: replace Abstract: This paper explores the system 1 thinking capability of Large Reasoning Models (LRMs), the intuitive ability to respond efficiently with minimal token usage. While existing LRMs rely on long-chain reasoning and excel at complex …

  106. arXiv cs.LG TIER_1 English(EN) · Jugal Gajjar, Kamalasankari Subramaniakuppusamy ·

    RSAT:结构化归因使小型语言模型成为忠实的表格推理器

    arXiv:2605.00199v1 Announce Type: cross Abstract: When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning w…

  107. arXiv cs.LG TIER_1 English(EN) · Arunabh Srivastava (Amir), Mohammad A. (Amir), Khojastepour, Srimat Chakradhar, Sennur Ulukus ·

    RunAgent:通过约束引导执行来解释自然语言计划

    arXiv:2605.00798v1 Announce Type: new Abstract: Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language pla…

  108. arXiv cs.CL TIER_1 English(EN) · Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu ·

    短链、深思:通过分合优化平衡推理效率与段内能力

    arXiv:2602.03141v3 Announce Type: replace Abstract: While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and compu…

  109. arXiv cs.CL TIER_1 English(EN) · Diane Tchuindjo, Omar Khattab ·

    推理密集型回归

    arXiv:2508.21762v3 Announce Type: replace Abstract: AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks s…

  110. arXiv cs.LG TIER_1 English(EN) · Yuxuan Gao, Megan Wang, Yi Ling Yu ·

    Token Arena:统一AI推理中能源与认知的连续基准

    arXiv:2605.00300v1 Announce Type: cross Abstract: Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quan…

  111. Hugging Face Daily Papers TIER_1 English(EN) ·

    STABLEVAL:AI系统的差异感知与稳定评估

    Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yie…

  112. arXiv cs.CL TIER_1 English(EN) · Guilin Qi ·

    StressEval:面向知识密集型推理的大型语言模型的故障驱动动态基准测试

    Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especially on knowledge intensive reasoning tasks While recent dynamic benchmarks can alleviate staleness they often increase difficulty at the expense of answerability and controllability In…

  113. arXiv cs.CL TIER_1 English(EN) · Nikos Karacapilidis ·

    Maistros:一种通过大型推理模型的知识蒸馏改编的希腊大型语言模型

    Large Language Models (LLMs) have substantially advanced the field of Natural Language Processing (NLP), achieving state-of-the-art performance across a wide range of tasks. These improvements have been attributed, in part, to their emerging reasoning capabilities, which are enab…

  114. arXiv cs.CL TIER_1 English(EN) · Kwan Soo Shin ·

    推理陷阱:封闭系统多步 LLM 推理的信息论界限

    When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to …

  115. arXiv cs.CL TIER_1 English(EN) · Sennur Ulukus ·

    RunAgent:通过约束引导执行来解释自然语言计划

    Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through co…

  116. arXiv cs.CL TIER_1 English(EN) · Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He ·

    MoRI:为大型语言模型中的科学构思学习动机驱动的推理

    arXiv:2603.19044v3 Announce Type: replace Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-lev…

  117. arXiv cs.LG TIER_1 English(EN) · Samuel Pastva, Van-Giang Trinh ·

    BAss: 抽象辩证框架中的符号推理

    arXiv:2604.27576v1 Announce Type: cross Abstract: We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, an…

  118. arXiv cs.CL TIER_1 English(EN) · Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan, Huawei Shen ·

    Latent-GRPO:用于潜在推理的组相对策略优化

    arXiv:2604.27998v1 Announce Type: cross Abstract: Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning met…

  119. arXiv cs.AI TIER_1 English(EN) · Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu ·

    Mull-Tokens:模态无关的潜在思维

    arXiv:2512.10941v2 Announce Type: replace-cross Abstract: Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images ar…

  120. arXiv cs.AI TIER_1 English(EN) · Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan ·

    组合与融合:重新审视多模态推理的基础瓶颈

    arXiv:2509.23744v4 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added …

  121. arXiv cs.AI TIER_1 English(EN) · Chengcao Yang, Jun Chen ·

    ANCORA:通过流形锚定自我博弈学习提问以实现可验证推理

    arXiv:2604.27644v1 Announce Type: cross Abstract: We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introd…

  122. arXiv cs.AI TIER_1 English(EN) · Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras ·

    合规性与合理性之争:大型语言模型推理的可控性研究

    arXiv:2604.27251v1 Announce Type: cross Abstract: Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoni…

  123. arXiv cs.AI TIER_1 English(EN) · Shouren Wang, Wang Yang, Chuang Ma, Debargha Ganguly, Vikash Singh, Chaoda Song, Xinpeng Li, Xianxuan Long, Vipin Chaudhary, Xiaotian Han ·

    Path-Lock专家:通过架构级分离在混合思维中分离推理模式

    arXiv:2604.27201v1 Announce Type: cross Abstract: Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Ex…

  124. arXiv cs.AI TIER_1 English(EN) · Adam Ishay, Joohyung Lee ·

    LLMs 作为 ASP 程序员:自纠错实现任务无关的非单调推理

    arXiv:2604.27960v1 Announce Type: new Abstract: Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While …

  125. arXiv cs.AI TIER_1 English(EN) · Yang Zhang, Jiangyuan Zhao, Chenyou Fan, Fangzheng Yan, Tian Li, Haitong Tang, Sen Fu, Xuan'er Wu, Qizhen Weng, Weinan Zhang, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li ·

    PRTS: 通过对比表示的原始推理和任务系统

    arXiv:2604.27472v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot lear…

  126. arXiv cs.AI TIER_1 English(EN) · Yi Ling Yu ·

    Token Arena:统一AI推理中能源与认知的连续基准

    Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving s…

  127. arXiv cs.CL TIER_1 English(EN) · Kamalasankari Subramaniakuppusamy ·

    RSAT:结构化归因使小型语言模型成为忠实的表格推理器

    When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidenc…

  128. arXiv cs.CL TIER_1 English(EN) · Huawei Shen ·

    Latent-GRPO:用于潜在推理的组相对策略优化

    Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However, existing latent reasoning methods mainly focus on supervised learning, and rein…

  129. arXiv cs.AI TIER_1 English(EN) · Joohyung Lee ·

    大型语言模型作为ASP程序员:自纠错实现任务无关的非单调推理

    Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these…

  130. Hugging Face Daily Papers TIER_1 English(EN) ·

    大型语言模型作为ASP程序员:自纠错实现任务无关的非单调推理

    Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these…

  131. arXiv cs.LG TIER_1 English(EN) · Jun Chen ·

    ANCORA:通过流形锚定自我博弈学习提问以实现可验证推理

    We propose a paradigm shift from learning to answer to learning to question: can a language model generate verifiable problems, solve them, and turn the resulting feedback into self-improvement without human supervision? We introduce ANCORA, an anchored-curriculum framework in wh…

  132. arXiv cs.LG TIER_1 English(EN) · Van-Giang Trinh ·

    BAss: 抽象辩证框架中的符号推理

    We present BAss (BDD-based ADF symbolic solver), a novel analysis tool for Abstract Dialectical Frameworks (ADFs) based on Binary Decision Diagrams (BDDs). It supports the fully symbolic computation of all admissible, complete, and preferred interpretations, as well as two-valued…

  133. arXiv cs.LG TIER_1 English(EN) · Zhiquan Tan, Yinrong Hong ·

    PAINT: 部分解自适应插值训练用于自蒸馏推理器

    arXiv:2604.26573v1 Announce Type: new Abstract: Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exp…

  134. arXiv cs.AI TIER_1 English(EN) · Ioannis Konstantoulas, Dimosthenis Tsimas, Pavlos Peppas, Kyriakos Sgarbas ·

    自回归推理

    arXiv:2604.26507v1 Announce Type: new Abstract: Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could…

  135. arXiv cs.CL TIER_1 English(EN) · Dongxin Guo, Jikun Wu, Siu Ming Yiu ·

    何时在推理中检索:大型推理模型的自适应检索

    arXiv:2604.26649v1 Announce Type: cross Abstract: Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current R…

  136. arXiv cs.CL TIER_1 English(EN) · Nikolaos Aletras ·

    合规性与合理性之争:大型语言模型推理的可控性研究

    Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abd…

  137. arXiv cs.CL TIER_1 English(EN) · Xiaotian Han ·

    Path-Lock专家:通过架构级分离在混合思维中分离推理模式

    Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data…

  138. arXiv cs.CL TIER_1 English(EN) · Siu Ming Yiu ·

    何时在推理中检索:大型推理模型的自适应检索

    Large reasoning models such as DeepSeek-R1 and OpenAI o1 generate extended chains of thought spanning thousands of tokens, yet their integration with retrieval-augmented generation (RAG) remains fundamentally misaligned. Current RAG systems optimize for providing context before r…

  139. arXiv cs.LG TIER_1 English(EN) · Yinrong Hong ·

    PAINT: 部分解自适应插值训练用于自蒸馏推理器

    Improving large language model (LLM) reasoning requires supervision that is both aligned with the model's own test-time states and informative at the token level. Reinforcement learning with verifiable rewards provides on-policy exploration but offers sparse, high-variance credit…

  140. Hugging Face Daily Papers TIER_1 English(EN) ·

    Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

    Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will …

  141. arXiv cs.AI TIER_1 English(EN) · Kyriakos Sgarbas ·

    自回归推理

    Background & Objectives: In the last decade, Machine learning research has grown rapidly, but large models are reaching their soft limits demonstrating diminishing returns and still lack solid reasoning abilities. These limits could be surpassed through synergistic combination of…

  142. arXiv cs.CL TIER_1 English(EN) · Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, Tal Linzen ·

    RELIC:通过识别上下文中的语言来评估复杂推理

    arXiv:2506.05205v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly used to solve complex tasks where they must retrieve and compose many pieces of in-context information in long reasoning chains. For many real-world tasks it is hard to accurately ga…

  143. arXiv cs.LG TIER_1 English(EN) · Chu-Cheng Lin, Eugene Ie ·

    模型应以多快的速度致力于监督?在 Tsallis 损失连续统上训练推理模型

    arXiv:2604.25907v1 Announce Type: new Abstract: Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis…

  144. arXiv cs.LG TIER_1 English(EN) · Maixent Chenebaux ·

    Nautile-370M:小型推理模型中的谱记忆与注意力机制的结合

    arXiv:2604.24809v1 Announce Type: new Abstract: We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a …

  145. arXiv cs.CL TIER_1 English(EN) · Pratham Singla, Shivank Garg, Ayush Singh, Ishan Garg, Ketan Suhaas Saichandran ·

    思考关于思考:评估后训练语言模型的推理能力

    arXiv:2510.16340v2 Announce Type: replace Abstract: Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This developme…

  146. arXiv cs.CL TIER_1 English(EN) · Yixiao Zhou, Dongzhou Cheng, zhiliang wu, Yi Yang, Yu Cheng, Hehe Fan ·

    一个精炼器,尽在掌握:通过强化查询精炼实现推理时推理提取

    arXiv:2604.25444v1 Announce Type: new Abstract: Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment m…

  147. arXiv cs.CL TIER_1 English(EN) · Oliver Kraus, Yash Sarrof, Yuekun Yao, Alexander Koller, Michael Hahn ·

    Transformer 实现通用推理的障碍(以及如何克服它们)

    arXiv:2604.25800v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces long…

  148. arXiv cs.CL TIER_1 English(EN) · Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, Zheng Sun, Honghao He, Yuchen Wu, Bihui Yu, Linzhuang Sun, Cheng Tan, Jingxuan Wei ·

    RL如何解锁几何交错推理中的“啊哈时刻”

    arXiv:2603.01070v2 Announce Type: replace Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have dem…

  149. arXiv cs.CL TIER_1 English(EN) · Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang ·

    当思想遇上事实:长上下文语言模型的可复用推理

    arXiv:2510.07499v2 Announce Type: replace Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents …

  150. arXiv cs.AI TIER_1 English(EN) · Eugene Ie ·

    模型应以多快的速度进行监督?在 Tsallis 损失连续体上训练推理模型

    Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ th…

  151. arXiv cs.CL TIER_1 English(EN) · Michael Hahn ·

    Transformer 实现通用推理的障碍(以及如何克服它们)

    Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied…

  152. Hugging Face Daily Papers TIER_1 English(EN) ·

    一个精炼器,尽可束之:通过强化查询精炼实现推理时推理提取

    Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by …

  153. arXiv cs.CL TIER_1 English(EN) · Hehe Fan ·

    一个精炼器,尽可束之:通过强化查询精炼实现推理时推理诱导

    Large Language Models (LLMs) often fail to utilize their latent reasoning capabilities due to a distributional mismatch between ambiguous human inquiries and the structured logic required for machine activation. Existing alignment methods either incur prohibitive $O(N)$ costs by …

  154. arXiv cs.CL TIER_1 English(EN) · Yuxuan Jiang, Dawei Li, Francis Ferraro ·

    DRP:基于技能感知步长分解的蒸馏推理剪枝,用于高效大型推理模型

    arXiv:2505.13975v4 Announce Type: replace Abstract: While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantia…

  155. arXiv cs.AI TIER_1 English(EN) · Guangxiang Zhao, Qilong Shi, Xusen Xiao, Xiangzheng Zhang, Tong Yang, Lin Sun ·

    以推理能力思考:更少Token,更高精度

    arXiv:2604.21764v2 Announce Type: replace Abstract: Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliber…

  156. arXiv cs.AI TIER_1 English(EN) · Dahlia Shehata, Ming Li ·

    超越注意力稳定性边界:代理式自合成推理协议

    arXiv:2604.24512v1 Announce Type: new Abstract: As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode t…

  157. arXiv cs.AI TIER_1 English(EN) · Sinin Zhang, Yunfei Xie, Yuxuan Cheng, Haoyu Zhang, Tong Zhang ·

    PhysNote:用于视觉语言模型可进化物理推理的自知笔记

    arXiv:2604.24443v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning…

  158. arXiv cs.AI TIER_1 English(EN) · Zichuan Fu, Xian Wu, Guojing Li, Yejing Wang, Yijun Chen, Zihao Zhao, Yixuan Luo, Hanyu Yan, Yefeng Zheng, Xiangyu Zhao ·

    Tandem:携手大小语言模型,实现高效推理

    arXiv:2604.23623v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have catalyzed the rise of reasoning-intensive inference paradigms, where models perform explicit step-by-step reasoning before generating final answers. While such approaches impr…

  159. arXiv cs.AI TIER_1 English(EN) · Yijiashun Qi, Xiang Xu, Yuxuan Li ·

    纠正性提示的负面影响:在 OWL~2~DL 下,LLM 在推理器引导的、关于蕴含否定时的过度谨慎修复中的提示设计

    arXiv:2604.23398v1 Announce Type: new Abstract: We report a reproducible error pattern in GPT-5.4 on OWL~2~DL compliance queries: the model frequently answers `"unknown'' when the reasoner-entailed answer is ""no'' under \emph{FunctionalProperty} closure or class \emph{disjointne…

  160. arXiv cs.AI TIER_1 English(EN) · Akihiro Takemura, Katsumi Inoue, Masaaki Nishino ·

    神经符号学习中基于约束的推理捷径分析

    arXiv:2604.23377v1 Announce Type: new Abstract: Neurosymbolic systems can satisfy logical constraints during learning without achieving the intended concept-label correspondence; this is a problem known as reasoning shortcuts. We formalize reasoning shortcuts as a constraint sati…

  161. arXiv cs.CL TIER_1 English(EN) · Anej Svete, Ashish Sabharwal ·

    掩码扩散语言模型的推理能力

    arXiv:2510.13117v3 Announce Type: replace-cross Abstract: Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inher…

  162. arXiv cs.CL TIER_1 English(EN) · Zhiyuan Lu, Chenliang Li, Yingcheng Shi, Weizhou Shen, Ming Yan, Fei Huang ·

    CorpusQA:一个用于语料库级别分析和推理的千万级Token基准测试

    arXiv:2601.14952v2 Announce Type: replace Abstract: While large language models now handle million-token contexts, their capacity for reasoning across entire document repositories remains largely untested. Existing benchmarks are inadequate, as they are mostly limited to single l…

  163. arXiv cs.CL TIER_1 English(EN) · Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang ·

    Think-at-Hard:选择性潜在迭代以改进推理语言模型

    arXiv:2511.08577v2 Announce Type: replace Abstract: Improving reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine e…

  164. arXiv cs.CL TIER_1 English(EN) · Sharan Ramjee ·

    别有用心:检测连续思维模型中的失调推理

    arXiv:2604.23460v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) reasoning has emerged as a key technique for eliciting complex reasoning in Large Language Models (LLMs). Although interpretable, its dependence on natural language limits the model's expressive bandwidth. C…

  165. arXiv cs.CL TIER_1 English(EN) · Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu ·

    幂律的力量:不对称性实现组合推理

    arXiv:2604.22951v1 Announce Type: cross Abstract: Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help mo…

  166. arXiv cs.CL TIER_1 English(EN) · Sercan Karaka\c{s}, Yusuf \c{S}im\c{s}ek ·

    土耳其语源敏感推理基准测试:人类与大型语言模型在证据信任操纵下的表现

    arXiv:2604.24665v1 Announce Type: new Abstract: This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze …

  167. arXiv cs.CL TIER_1 English(EN) · Han Wang, Xiaodong Yu, Jialian Wu, Jiang Liu, Ximeng Sun, Mohit Bansal, Zicheng Liu ·

    通过步级优势选择稳定高效推理

    arXiv:2604.24003v1 Announce Type: new Abstract: Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this ove…

  168. arXiv cs.CL TIER_1 English(EN) · Zixuan Wang, Yuanyuan Lei ·

    大型语言模型中逻辑推理的知识向量

    arXiv:2604.23877v1 Announce Type: new Abstract: Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze …

  169. arXiv cs.CL TIER_1 English(EN) · Yusuf Şimşek ·

    土耳其源敏感推理的基准测试:人类与大型语言模型在证据信任操纵下的表现

    This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly…

  170. Hugging Face Daily Papers TIER_1 English(EN) ·

    土耳其源敏感推理基准测试:人类与大型语言模型在证据信任操纵下的表现

    This paper investigates whether source trustworthiness shapes Turkish evidential morphology and whether large language models (LLMs) track this sensitivity. We study the past-domain contrast between -DI and -mIs in controlled cloze contexts where the information source is overtly…

  171. arXiv cs.AI TIER_1 English(EN) · Ming Li ·

    超越注意力稳定性边界:代理式自合成推理协议

    As LLM agents transition to autonomous digital coworkers, maintaining deterministic goal-directedness in non-linear multi-turn conversations emerged as an architectural bottleneck. We identify and formalize a systemic failure mode termed the Attention Latch in decoder-only autore…

  172. arXiv cs.AI TIER_1 English(EN) · Tong Zhang ·

    PhysNote:用于视觉语言模型可进化物理推理的自知笔记

    Vision-Language Models (VLMs) have demonstrated strong performance on textbook-style physics problems, yet they frequently fail when confronted with dynamic real-world scenarios that require temporal consistency and causal reasoning across frames. We identify two fundamental chal…

  173. arXiv cs.CL TIER_1 English(EN) · Keshav Ramji, Tahira Naseem, Ram\'on Fernandez Astudillo ·

    无言思考:基于抽象思维链的高效潜在推理

    arXiv:2604.22709v1 Announce Type: new Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging con…

  174. arXiv cs.CL TIER_1 English(EN) · Grigory Sapunov ·

    通用 Transformer 需要记忆:自适应递归推理中的深度-状态权衡

    arXiv:2604.21999v1 Announce Type: cross Abstract: We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are e…

  175. arXiv cs.CL TIER_1 English(EN) · Karthic Palaniappan ·

    通过强化学习激励视觉语言模型中的神经符号语言推理

    arXiv:2604.22062v1 Announce Type: new Abstract: There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movi…

  176. arXiv cs.CL TIER_1 English(EN) · Zicheng Liu ·

    通过步级优势选择稳定高效推理

    Large language models (LLMs) achieve strong reasoning performance by allocating substantial computation at inference time, often generating long and verbose reasoning traces. While recent work on efficient reasoning reduces this overhead through length-based rewards or pruning, m…

  177. arXiv cs.CL TIER_1 English(EN) · Yuanyuan Lei ·

    大型语言模型中逻辑推理的知识向量

    Logical reasoning serve as a central capability in LLMs and includes three main forms: deductive, inductive, and abductive reasoning. In this work, we study the knowledge representations of these reasoning types in LLMs and analyze the correlations among them. Our analysis shows …

  178. arXiv cs.CL TIER_1 English(EN) · Ramón Fernandez Astudillo ·

    无言思考:基于抽象思维链的高效潜在推理

    While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance l…

  179. arXiv cs.CL TIER_1 English(EN) · Karthic Palaniappan ·

    通过强化学习激励视觉语言模型中的神经符号语言推理

    There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louis…

  180. arXiv cs.CL TIER_1 English(EN) · Grigory Sapunov ·

    通用 Transformer 需要记忆:自适应递归推理中的深度-状态权衡

    We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations te…

  181. arXiv cs.AI TIER_1 English(EN) · Lin Sun ·

    以推理能力思考:更少Token,更高精度

    Reasoning LLMs often spend substantial tokens on long intermediate reasoning traces (e.g., chain-of-thought) when solving new problems. We propose to summarize and store reusable reasoning skills distilled from extensive deliberation and trial-and-error exploration, and to retrie…

  182. arXiv cs.AI TIER_1 English(EN) · Csaba Szepesvári ·

    洞察未见:Transformer在符号推理中的泛化能力

    We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that …

  183. Hugging Face Daily Papers TIER_1 English(EN) ·

    洞察未见:Transformer在符号推理中的泛化能力

    We investigate the ability of decoder-only transformer models to perform abstract symbolic reasoning; specifically solving propositional logic reasoning problems given in-context. Previous work demonstrated that models fail to generalize to problems involving variable names that …

  184. arXiv cs.CL TIER_1 English(EN) · Weiming Lu ·

    语言作为推理优化的潜在变量

    As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than mer…

  185. Ahead of AI (Sebastian Raschka) TIER_1 English(EN) · Sebastian Raschka, PhD ·

    首次深入解析从零开始的推理:第一章

    Welcome to the next stage of large language models (LLMs): reasoning. LLMs have transformed how we process and generate text, but their success has been largely driven by statistical pattern recognition. However, new advances in reasoning methodologies now enable LLMs to tackle m…

  186. arXiv cs.CV TIER_1 English(EN) · Jun Du ·

    通过闭环验证推理解锁复杂视觉生成

    Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are…

  187. arXiv stat.ML TIER_1 English(EN) · Masoud Asgharian ·

    暂停与反思:用于思维链推理的保形聚合

    Chain-of-thought (CoT) reasoning with self-consistency improves performance by aggregating multiple sampled reasoning paths. In this setting, correctness is no longer tied to a single reasoning trace but to the aggregation rule over a pool of candidate paths, making aggregation u…

  188. arXiv cs.CV TIER_1 English(EN) · Yanzhi Wang ·

    PhyGround:在生成式世界模型中进行物理推理的基准测试

    Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-…

  189. arXiv cs.CV TIER_1 English(EN) · Wentao Zhang ·

    Uni-Synergy:通过合作强化学习实现个性化推理的理解与生成桥梁

    Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy betw…

  190. arXiv stat.ML TIER_1 English(EN) · Naoto Iwase, Yuki Ichihara, Mohammad Atif Quamar, Junpei Komiyama ·

    通过前缀一致性实现可靠的思维链

    arXiv:2605.07654v1 Announce Type: new Abstract: Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT…

  191. arXiv cs.CV TIER_1 English(EN) · Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang ·

    通过激活重放提升大型多模态模型的推理能力

    arXiv:2511.19972v3 Announce Type: replace Abstract: Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-train…

  192. arXiv stat.ML TIER_1 English(EN) · Hen Davidov, Nachshon Cohen, Oren Kalinsky, Yaron Fairstein, Guy Kushilevitz, Ram Yazdi, Patrick Rebeschini ·

    何时退出:动态弃权在大型语言模型推理中的原则性框架

    arXiv:2604.18419v2 Announce Type: replace-cross Abstract: LLMs utilizing chain-of-thought reasoning often waste substantial compute by producing long, incorrect responses. Abstention can mitigate this by withholding outputs unlikely to be correct. While most abstention methods de…

  193. arXiv cs.CV TIER_1 English(EN) · Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yi ·

    OneVL:一步式潜在推理与规划,并附带视觉语言解释

    arXiv:2604.18486v2 Announce Type: replace Abstract: Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent Co…

  194. arXiv cs.CV TIER_1 English(EN) · Xiaoyu Yang, En Yu, Wei Duan, Jie Lu ·

    将漂移转化为约束:非平稳环境中的鲁棒推理对齐

    arXiv:2510.04142v2 Announce Type: replace Abstract: This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models of…

  195. LessWrong (AI tag) TIER_1 English(EN) · Sturb ·

    “不可压缩知识探测”的健全性检查

    <p><i><span>Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?</span></i></p><h1><span>Introduction</span></h1><p><span>Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of man…

  196. arXiv cs.CV TIER_1 English(EN) · Mahnoor Shahid, Hannes Rothe ·

    Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

    arXiv:2604.26521v1 Announce Type: cross Abstract: Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro…

  197. arXiv cs.CV TIER_1 English(EN) · Hannes Rothe ·

    Grounding vs. Compositionality: On the Non-Complementarity of Reasoning in Neuro-Symbolic Systems

    Compositional generalization remains a foundational weakness of modern neural networks, limiting their robustness and applicability in domains requiring out-of-distribution reasoning. A central, yet unverified, assumption in neuro-symbolic AI is that compositional reasoning will …

  198. arXiv cs.CV TIER_1 English(EN) · Zhiheng Wu, Tong Wang, Shuning Wang, Naiming Liu, Yumeng Zhang ·

    看得更远,思考更深:利用低级视觉线索和反思提升VLM的推理能力

    arXiv:2604.24339v1 Announce Type: new Abstract: Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information a…

  199. arXiv cs.CV TIER_1 English(EN) · Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu ·

    DRIFT:为高效MLLM微调迁移推理先验

    arXiv:2510.15050v2 Announce Type: replace Abstract: Multimodal large language models (MLLMs) have made rapid progress, yet their reasoning ability often lags behind strong text-only LLMs. Bridging this gap typically requires large-scale multimodal reasoning data or reinforcement …

  200. arXiv cs.CV TIER_1 English(EN) · Yumeng Zhang ·

    看得更远,思考更深:利用低级视觉线索和反思提升VLM的推理能力

    Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these p…

  201. arXiv cs.CV TIER_1 English(EN) · Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang ·

    测试时匹配:解锁多模态模型的组合推理能力

    arXiv:2510.07632v2 Announce Type: replace-cross Abstract: Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and…

  202. arXiv cs.CV TIER_1 English(EN) · Anoop Cherian, Radu Corcodel, Siddarth Jain, Diego Romeres ·

    LLMPhy:结合大语言模型与物理引擎的可参数识别物理推理

    arXiv:2411.08027v3 Announce Type: replace-cross Abstract: Most learning-based approaches to complex physical reasoning sidestep the crucial problem of parameter identification (e.g., mass, friction) that governs scene dynamics, despite its importance in real-world applications su…

  203. Smol AINews TIER_1 English(EN) ·

    Bespoke-Stratos + Sky-T1:推理的 Vicuna+Alpaca 时刻

    **Reasoning Distillation** has emerged as a key technique, with Berkeley/USC researchers releasing **Sky-T1-32B-Preview**, a finetuned model of **Qwen 2.5 32B** using 17k reasoning traces for just **$450**, matching benchmarks of **o1-preview**. **DeepSeek** introduced **R1**, a …

  204. Smol AINews TIER_1 English(EN) ·

    Qwen 提问:32B 开源权重推理模型在 GPQA/AIME/Math500 上接近 o1

    **DeepSeek r1** leads the race for "open o1" models but has yet to release weights, while **Justin Lin** released **QwQ**, a **32B open weight model** that outperforms **GPT-4o** and **Claude 3.5 Sonnet** on benchmarks. QwQ appears to be a fine-tuned version of **Qwen 2.5**, emph…

  205. Smol AINews TIER_1 English(EN) ·

    o1: OpenAI 的新通用推理模型

    **OpenAI** has released the **o1** model family, including **o1-preview** and **o1-mini**, focusing on test-time reasoning with extended output token limits over 30k tokens. The models show strong performance, ranking in the 89th percentile on competitive programming, excelling i…

  206. The Gradient TIER_1 English(EN) · Petar Veličković ·

    神经算法推理

    <p>In this article, we will talk about <em>classical computation</em>: the kind of computation typically found in an undergraduate Computer Science course on Algorithms and Data Structures [1]. Think shortest path-finding, sorting, clever ways to break problems down into simpler …

  207. HN — AI infrastructure stories TIER_1 English(EN) · ksec ·

    衡量AI推理的环境影响

  208. Pandaily TIER_1 English(EN) · [email protected] (Pandaily) ·

    LaST-R1:新的物理推理范式在LIBERO基准测试中达到99.9%的成功率

    A joint research from Zojian Power, Peking University, and CUHK proposes LaST-R1, a new embodied AI paradigm that achieves 99.9% success on LIBERO benchmark — 22.5% higher than π0.5 in real-world tasks.

  209. HN — claude cli stories TIER_1 English(EN) · Bayram ·

    Show HN:Retain – 统一管理所有 AI 编码对话的知识库

  210. Towards AI TIER_1 Deutsch(DE) · Kaushik Rajan ·

    推理的代价:小型模型胜出的4个任务

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/when-reasoning-hurts-4-tasks-where-smaller-models-win-88486b883896?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/1444/1*kh5lQexRfE9L2AK6F6Eosg.png" width=…

  211. Towards AI TIER_1 English(EN) · R. Thompson (PhD) ·

    蜂群思维的释放:群体如何以更低的计算成本提升推理能力

    <div class="medium-feed-item"><p class="medium-feed-image"><a href="https://pub.towardsai.net/the-hive-mind-unleashed-how-swarms-slash-compute-while-improving-reasoning-764757579924?source=rss----98111c9905da---4"><img src="https://cdn-images-1.medium.com/max/2600/1*-uX1zjPJiJmuX…

  212. dev.to — Anthropic tag TIER_1 English(EN) · Gabriel Anhaia ·

    Claude Opus 4.7 自适应思考:当推理令牌奏效时

    <ul> <li> <strong>Book:</strong> <a href="https://www.amazon.com/dp/B0GX38N645" rel="noopener noreferrer">Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs</a> </li> <li> <strong>Also by me:</strong> <em>Thinking in Go</em> (2-book series) — <a href="http…

  213. dev.to — LLM tag TIER_1 English(EN) · LyricalString ·

    用结构化推理解决大语言模型黑箱问题

    <p>The "black box" problem in Large Language Models is often discussed as a philosophical hurdle, but for engineers building high-stakes vertical applications, it is a hard technical bottleneck. In domains like legal tech, medical diagnosis, or financial auditing, a correct answe…

  214. r/LocalLLaMA TIER_1 English(EN) · /u/Thrumpwart ·

    结构化CoT:使用语法文件实现更短的推理

    &#32; submitted by &#32; <a href="https://www.reddit.com/user/Thrumpwart"> /u/Thrumpwart </a> <br /> <span><a href="https://andthattoo.dev/blog/structured_cot">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/LocalLLaMA/comments/1svtsm1/structured_cot_shorter_reaso…

  215. Mastodon — mastodon.social TIER_1 English(EN) · [email protected] ·

    一个教程探讨了如何解析、分析和可视化 lambda/hermes-agent-reasoning-traces 数据集中的推理痕迹。它涵盖了理解自动

    A tutorial explores how to parse, analyse and visualise reasoning traces from the lambda/hermes-agent-reasoning-traces dataset. It covers understanding how autonomous AI agents use tools and generate responses across multi-turn conversations. The guide shows how to prepare data f…

  216. Mastodon — mastodon.social TIER_1 English(EN) · aihaberleri ·

    📰 2026年LLM的5大Agentic推理基准,预测真实世界表现 AI代理正从演示走向企业应用,传统衡量标准

    📰 Top 5 Agentic Reasoning Benchmarks for LLMs in 2026 That Predict Real-World Performance As AI agents transition from demos to enterprise use, traditional metrics like MMLU fall short. The most critical benchmarks now measure real-world agentic reasoning—navigating complex tasks…

  217. Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri ·

    📰 智能体推理的7大基准:LLM的真实测试 LLM的智能体推理能力已超越纯粹的学术兴趣...

    📰 Agentic Reasoning için En Önemli 7 Benchmark: LLM'lerin Gerçek Testleri LLM'lerin agentic reasoning yetenekleri artık sadece akademik ilgi alanını aşarak endüstriyel uygulamalarda kritik bir avantaj haline geldi. 2025-2026 verileri, bu yetenekleri ölçen 7 temel benchmark'ın nas…

  218. Mastodon — mastodon.social TIER_1 Deutsch(DE) · aihaberleri ·

    📰 2025年软件开发中的AI代理:5个不会取代开发者的全新领域 AI代理不会通过取代来改变软件开发

    📰 KI-Agenten in der Softwareentwicklung 2025: 5 Neue Disziplinen, die Entwickler nicht ersetzen KI-Agenten verändern die Softwareentwicklung nicht durch Ersatz, sondern durch die Einführung neuer Disziplinen. Forscher der Chalmers University und der Volvo Group zeigen, dass Entwi…

  219. Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri ·

    📰 2026年AI代理不会取代开发者:新兴职业有哪些?(Chalmers研究)AI代理不会摧毁软件开发者

    📰 KI-Agentler Geliştiricileri Yerine Geçmiyor: 2026'da Yeni Meslekler Neler? (Chalmers Araştırması) Yapay zeka agentlerinin yazılım geliştiricilerini yok edeceğini iddia eden narratif, Chalmers Üniversitesi ve Volvo Group’un yeni araştırmasına göre yanıltıcı. Gerçek, teknolojinin…

  220. r/cursor TIER_2 English(EN) · /u/Specialist_Solid523 ·

    slop CLI (v1.0.0 ) 首次重大发布:防止推理漂移的工具

    &#32; submitted by &#32; <a href="https://www.reddit.com/user/Specialist_Solid523"> /u/Specialist_Solid523 </a> <br /> <span><a href="/r/LLMDevs/comments/1t4sr9z/slop_cli_major_release_v100/">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/cursor/comments/1t4u9mw/…