Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (DPO). These attacks exploit a parameter-independent shift in the DPO gradient caused by flipping preference labels. The methods convert the poisoning problem into a structured binary sparse approximation problem, with BAL-A utilizing lattice embedding and BMP-A adapting binary matching pursuit. Experiments on synthetic data and the Stanford Human Preferences dataset demonstrate the effectiveness of these attacks, showing how dataset geometry influences their success. AI
Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →
IMPACT Highlights potential vulnerabilities in RLHF training data, necessitating robust data validation and security measures for deployed models.
RANK_REASON Academic paper detailing novel attack methods on RLHF pipelines.