中文(ZH) 从最优传输角度训练奖励模型：让 RLHF 学会「忽略错误偏好」丨ICML 2026

SelectiveRM framework trains reward models to ignore noisy human feedback

By PulseAugur Editorial · [1 sources] · 2026-06-15 07:39

Researchers from Zhejiang University, Xiaohongshu, and Peking University have developed SelectiveRM, a novel framework for training reward models in large language models. This approach addresses the issue of noisy or inaccurate preference data, which is common in human and AI-generated feedback. Instead of forcing the model to fit all observed preferences, SelectiveRM uses partial optimal transport to selectively align distributions, identifying and excluding conflicting or erroneous data points. This leads to more reliable reward functions and improved safety in downstream reinforcement learning from human feedback (RLHF) processes. AI

IMPACT This research offers a more principled approach to training reward models, potentially leading to safer and more reliable AI systems by filtering out erroneous feedback.

RANK_REASON The cluster describes a new research paper proposing a novel framework for training reward models in LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on 雷峰网 (Leiphone) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

SelectiveRM framework trains reward models to ignore noisy human feedback

COVERAGE [1]

雷峰网 (Leiphone) TIER_1 中文(ZH) · 2026-06-15 07:39

Training Reward Models from Optimal Transport Perspective: Enabling RLHF to Learn to 'Ignore Incorrect Preferences' | ICML 2026

<section style="text-align: center; margin: 0px 16px; line-height: 1.75em; display: block;"><img class="rich_pages wxw-img" src="https://static.leiphone.com/uploads/new/images/20260615/6a2fab1e1957c.jpg?imageMogr2/quality/90" style="width: 100%; display: inline-block; text-align:…

COVERAGE [1]

Training Reward Models from Optimal Transport Perspective: Enabling RLHF to Learn to 'Ignore Incorrect Preferences' | ICML 2026

RELATED ENTITIES

RELATED TOPICS