PulseAugur
LIVE 09:32:26
research · [7 sources] ·
0
research

New research explores advanced reward modeling for LLMs and diffusion models

Several new research papers explore advancements in reward modeling for AI alignment, particularly for large language models and diffusion models. One paper introduces SelectiveRM, a framework using optimal transport to handle noisy human preferences in reward modeling. Another paper, CAMEL, proposes a confidence-gated reflection method that selectively invokes reflection for low-confidence instances, achieving state-of-the-art accuracy with fewer parameters. Additionally, a new benchmark called RMGAP has been developed to evaluate the generalization of reward models across diverse user preferences, revealing significant limitations in current models. Finally, ArenaPO leverages Arena scores for efficient, fine-grained preference optimization in diffusion models without explicit reward modeling. AI

Summary written by gemini-2.5-flash-lite from 7 sources. How we write summaries →

IMPACT New techniques and benchmarks aim to improve AI alignment and efficiency, potentially leading to more capable and reliable models.

RANK_REASON Multiple new arXiv papers introduce novel methods and benchmarks for improving reward modeling in AI.

Read on arXiv cs.CL →

COVERAGE [7]

  1. arXiv cs.LG TIER_1 · Licheng Pan, Haochen Yang, Haoxuan Li, Yunsheng Lu, Yongqi Tong, Yinuo Wang, Shijian Wang, Zhixuan Chu, Lei Shen, Yuan Lu, Hao Wang ·

    Optimal Transport for LLM Reward Modeling from Noisy Preference

    arXiv:2605.06036v1 Announce Type: new Abstract: Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing …

  2. arXiv cs.LG TIER_1 · Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye ·

    Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

    arXiv:2604.17415v2 Announce Type: replace Abstract: Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives,…

  3. arXiv cs.CL TIER_1 · Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu, Yang You ·

    CAMEL: Confidence-Gated Reflection for Reward Modeling

    arXiv:2602.20670v2 Announce Type: replace Abstract: Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpret…

  4. arXiv cs.CL TIER_1 · Yangyang Zhou, Yi-Chen Li ·

    RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

    arXiv:2605.01831v1 Announce Type: new Abstract: Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability…

  5. arXiv cs.CL TIER_1 · Yi-Chen Li ·

    RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences

    Reinforcement Learning from Human Feedback has become the standard paradigm for language model alignment, where reward models directly determine alignment effectiveness. In this work, we focus on how to evaluate the generalizability of reward models. By "generalizability", we mea…

  6. arXiv cs.CV TIER_1 · Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang, Qingyi Gu, Zhen Dong ·

    Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    arXiv:2605.06070v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit re…

  7. arXiv cs.CV TIER_1 · Zhen Dong ·

    Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However,…