PulseAugur
实时 14:07:38
English(EN) EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

DVPO和EVPO通过新颖的RL优化技术推进LLM训练后

研究人员引入了DVPO,这是一个新的强化学习框架,旨在改进大型语言模型(LLM)的训练后,特别是在处理嘈杂或不完整的监督信号时。DVPO利用分布值建模和不对称风险正则化来平衡鲁棒性和泛化性,旨在避免现有方法可能产生的过于保守的策略。在对话、数学推理和科学问答任务上的实验表明,在嘈杂条件下,DVPO的表现优于PPO和GRPO等标准方法。 AI

影响 引入了更稳定和更具泛化性的LLM训练后新方法,尤其是在具有挑战性的真实世界数据条件下。

排序理由 该集群包含两篇学术论文,详细介绍了用于LLM训练后的新颖强化学习技术。

在 Hugging Face Daily Papers 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

DVPO和EVPO通过新颖的RL优化技术推进LLM训练后

报道来源 [2]

  1. arXiv cs.LG TIER_1 English(EN) · Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui ·

    DVPO:基于分布值建模的策略优化用于LLM训练后

    arXiv:2512.03847v2 Announce Type: replace Abstract: Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabiliz…

  2. Hugging Face Daily Papers TIER_1 English(EN) ·

    EVPO:用于LLM训练后自适应评论员利用的解释方差策略优化

    Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have…