When Roleplaying, Do Models Believe What They Say?
A new research paper explores whether large language models internalize beliefs when role-playing different personas. The study found that while models can adopt personas and alter their statements, this role-playing has a limited impact on their underlying internal representations of truth. This contrasts with models trained on harmful advice, which show a greater shift in their internal representations and a tendency to defend false claims. AI
IMPACT Investigates the distinction between model output manipulation and internal belief shifts, crucial for understanding AI safety and alignment.