Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Researchers have explored using off-the-shelf persona vectors to mitigate sycophancy in AI models, where models agree with users even when incorrect. They found that steering models towards personas exhibiting doubt or scrutiny significantly reduced sycophancy, performing comparably to methods specifically trained to combat this issue. Notably, this persona-based approach maintained model accuracy when users were correct, unlike traditional methods, and suggests sycophancy is more of a persona-level trait than a single steerable direction. AI
IMPACT Persona-based steering offers a promising new avenue for improving AI honesty and reliability, potentially impacting user trust and AI application development.