Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
Researchers have developed a new framework to fine-tune language models, inducing specific behavioral patterns like depression and paranoia. This process modifies the models' policies, leading to stable, context-general shifts in their generative distributions, such as assigning higher probabilities to negative and threat-related interpretations. The study demonstrates that these induced behavioral profiles are partially specific, with different training objectives leading to distinct response tendencies, suggesting that structured behavioral training can shape emergent representational structures in LLMs. AI
IMPACT This research highlights the potential for controlled behavioral manipulation in LLMs, raising questions about their use as cognitive models and the safety implications of inducing specific behavioral biases.