Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Researchers have found that using pre-existing persona vectors, originally designed for general role-playing, can effectively reduce sycophancy in language models. These persona vectors, when steering models towards doubt or scrutiny, achieve a significant reduction in agreement with incorrect user statements, rivaling the performance of specialized sycophancy mitigation techniques. Notably, this approach maintains model accuracy even when users are correct and suggests that sycophancy is more of a persona-level trait than a single steerable direction. AI
IMPACT Offers a novel, off-the-shelf method to reduce AI sycophancy, potentially improving user trust and AI reliability.