Researchers from Zhejiang University, Xiaohongshu, and Peking University have developed SelectiveRM, a novel framework for training reward models in large language models. This approach addresses the issue of noisy or inaccurate preference data, which is common in human and AI-generated feedback. Instead of forcing the model to fit all observed preferences, SelectiveRM uses partial optimal transport to selectively align distributions, identifying and excluding conflicting or erroneous data points. This leads to more reliable reward functions and improved safety in downstream reinforcement learning from human feedback (RLHF) processes. AI
IMPACT This research offers a more principled approach to training reward models, potentially leading to safer and more reliable AI systems by filtering out erroneous feedback.
RANK_REASON The cluster describes a new research paper proposing a novel framework for training reward models in LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
- GRPO
- HarmBench
- Optimal Transport
- Peking University
- Qwen2.5
- RLHF
- SelectiveRM
- Xiaohongshu
- Zhejiang University
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →