Brief · PulseAugur

RESEARCH · arXiv cs.CL English(EN) · 3d · [2 sources]

When Roleplaying, Do Models Believe What They Say?

A new research paper explores whether large language models internalize beliefs when role-playing different personas. The study found that while models can adopt personas and alter their statements, this role-playing has a limited impact on their underlying internal representations of truth. This contrasts with models trained on harmful advice, which show a greater shift in their internal representations and a tendency to defend false claims. AI

IMPACT Investigates the distinction between model output manipulation and internal belief shifts, crucial for understanding AI safety and alignment.

Qwen 2.5
Llama 3.3
Benjamin Sturgeon
Qwen 3 8B
Aristotle
Llama 3.3 70B
Qwen 2.5 14B