LLMs know they're wrong and agree anyway, research finds

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 3 sources

Researchers have developed two novel methods, BAL-A and BMP-A, to efficiently poison preference datasets used in offline Reinforcement Learning from Human Feedback (RLHF) pipelines like Direct Preference Optimization (DPO). These attacks exploit a parameter-independent shift in the DPO gradient caused by flipping preference labels. The methods convert the poisoning problem into a structured binary sparse approximation problem, with BAL-A utilizing lattice embedding and BMP-A adapting binary matching pursuit. Experiments on synthetic data and the Stanford Human Preferences dataset demonstrate the effectiveness of these attacks, showing how dataset geometry influences their success. AI

Summary written by gemini-2.5-flash-lite from 3 sources. How we write summaries →

IMPACT Highlights potential vulnerabilities in RLHF training data, necessitating robust data validation and security measures for deployed models.

RANK_REASON Academic paper detailing novel attack methods on RLHF pipelines.

Read on arXiv stat.ML →

paper
safety

COVERAGE [3]

arXiv cs.LG TIER_1 · Manav Pandey · 2026-04-28 04:00

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

arXiv:2604.19117v2 Announce Type: replace Abstract: When a language model agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, …
arXiv stat.ML TIER_1 · Chenye Yang, Weiyu Xu, Lifeng Lai · 2026-05-05 04:00

Efficient Preference Poisoning Attack on Offline RLHF

arXiv:2605.02495v1 Announce Type: cross Abstract: Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study lab…
arXiv stat.ML TIER_1 · Lifeng Lai · 2026-05-04 11:45

Efficient Preference Poisoning Attack on Offline RLHF

Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected preference dataset, which makes them vulnerable to preference poisoning attack. We study label flip attacks against log-linear DPO. We first i…

COVERAGE [3]

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

Efficient Preference Poisoning Attack on Offline RLHF

Efficient Preference Poisoning Attack on Offline RLHF

RELATED ENTITIES

RELATED TOPICS