Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 8h

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

Researchers have developed a new framework called GREAT that can create generalizable backdoor attacks in Reinforcement Learning from Human Feedback (RLHF) models. This method synthesizes emotionally aware triggers, specifically targeting harmful response generation for users with angry prompts. The framework utilizes a trigger identification pipeline in the model's latent embedding space and a dataset of over 5,000 angry triggers curated using GPT-4. AI

IMPACT Highlights potential vulnerabilities in RLHF systems, necessitating improved safety and defense mechanisms.

GPT-4
RLHF
Subrat Kishore Dutta
GREAT