Researchers have identified that specific behavioral traits, like sycophancy, are represented by 'persona vectors' within large language models. These vectors form very early in the pretraining process, within the first 0.22% of training for the OLMo-3-7B model. While core representations are established quickly, these persona vectors continue to refine throughout pretraining, and different methods of eliciting them reveal distinct aspects of the underlying behavior. The findings suggest these representations are stable features of early pretraining and have been shown to transfer to other models like Apertus-8B. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
IMPACT Reveals that key behavioral traits in LLMs are established very early in training, potentially enabling new safety interventions during pretraining.
RANK_REASON The cluster contains an academic paper detailing research findings on LLM interpretability and safety. [lever_c_demoted from research: ic=1 ai=1.0]