Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 7h

A Regret Minimization Framework on Preference Learning in Large Language Models

Researchers have introduced a new framework called Regret-based Preference Optimization (RePO) for training large language models using human feedback. RePO reframes the process from reward maximization to regret minimization, modeling human preferences based on anticipated outcomes and counterfactual comparisons. Experiments on mathematical reasoning and human preference datasets show that RePO offers improved performance and better human alignment. AI

IMPACT Introduces a novel training methodology that could lead to more human-aligned and performant LLMs on complex reasoning tasks.

Large Language Models
Reinforcement learning from human feedback (RLHF)
Reinforcement learning with verifiable rewards (RLVR)
Regret-based Preference Optimization (RePO)