Researchers have developed a new framework called FiMi-RM to address length bias in reward models used for Reinforcement Learning from Human Feedback (RLHF). This bias causes reward models to favor longer responses, even if they are not of higher quality. FiMi-RM works in three stages: training a standard reward model, using a lightweight model to capture non-linear length-reward relationships, and then integrating this learned bias into the reward model to decouple length from reward. Experiments show that FiMi-RM leads to a more balanced length-reward distribution and improves alignment algorithms like Direct Preference Optimization (DPO) by reducing verbosity without sacrificing performance. AI
IMPACT Addresses a key limitation in RLHF, potentially leading to more aligned and concise LLM responses.
RANK_REASON Academic paper detailing a new method for mitigating bias in RLHF reward models. [lever_c_demoted from research: ic=1 ai=1.0]
- arXiv
- Direct Preference Optimization
- FiMi-RM
- Kangwen Zhao
- reinforcement learning from human feedback
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →