PulseAugur
LIVE 06:08:20
research · [2 sources] ·
0
research

DVPO and EVPO advance LLM post-training with novel RL optimization techniques

Researchers have introduced DVPO, a new reinforcement learning framework designed for improving Large Language Model (LLM) post-training, particularly when dealing with noisy or incomplete supervision signals. DVPO utilizes distributional value modeling and asymmetric risk regularization to balance robustness against generalization, aiming to avoid overly conservative policies that can arise from existing methods. Experiments across dialogue, math reasoning, and scientific QA tasks show DVPO outperforming standard approaches like PPO and GRPO under noisy conditions. AI

Summary written by gemini-2.5-flash-lite from 2 sources. How we write summaries →

IMPACT Introduces new methods for more stable and generalizable LLM post-training, especially in challenging real-world data conditions.

RANK_REASON The cluster contains two academic papers detailing novel reinforcement learning techniques for LLM post-training.

Read on Hugging Face Daily Papers →

COVERAGE [2]

  1. arXiv cs.LG TIER_1 · Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui ·

    DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

    arXiv:2512.03847v2 Announce Type: replace Abstract: Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabiliz…

  2. Hugging Face Daily Papers TIER_1 ·

    EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

    Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have…