PulseAugur
实时 05:09:19

新方法通过恢复近边界信号来稳定LLM推理

研究人员发现,可验证奖励强化学习(RLVR)中存在一个关键瓶颈,阻碍了LLM推理优化。研究指出,标准硬裁剪方法中的僵化裁剪决策是原因,它丢弃了裁剪阈值附近的宝贵信号。为解决此问题,他们提出了近边界随机恢复(NSR)方法,这是一种简单的修改,可以随机保留这些略微超出边界的token,从而提高各种模型大小和架构的训练稳定性和性能。 AI

影响 提高了LLM推理任务的训练稳定性和性能,有望实现更强大、更具能力的模型。

排序理由 该集群包含一篇学术论文,详细介绍了一种提高LLM训练稳定性方面的新方法。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

报道来源 [2]

  1. arXiv cs.LG TIER_1 English(EN) · Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding, Guoyin Wang, Li Yuan, Jingren Zhou ·

    Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

    arXiv:2605.22703v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissect…

  2. arXiv cs.LG TIER_1 English(EN) · Jingren Zhou ·

    Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

    Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we …