Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 1mo

Reward Modeling from Natural Language Human Feedback

Researchers have introduced a new method called Reward Modeling from Natural Language Human Feedback (RM-NLHF) to improve the training of Generative Reward Models (GRMs). Traditional methods using pairwise preference data can lead to GRMs learning to guess correct outcomes without genuine understanding, introducing noise into the training signal. RM-NLHF addresses this by using natural language critiques from humans to provide more accurate process reward signals, which are then used to train GRMs. The approach also includes a Meta Reward Model (MetaRM) to generalize from limited human critiques to larger datasets. AI

IMPACT Improves training signal accuracy for reward models, potentially leading to more robust and reliable AI systems.

RM-NLHF
GRM
Reinforcement Learning with Verifiable reward
Zongqi Wang