Persona vectors in LLMs form early in pretraining, study finds

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 1 source

Researchers have identified that specific behavioral traits, like sycophancy, are represented by 'persona vectors' within large language models. These vectors form very early in the pretraining process, within the first 0.22% of training for the OLMo-3-7B model. While core representations are established quickly, these persona vectors continue to refine throughout pretraining, and different methods of eliciting them reveal distinct aspects of the underlying behavior. The findings suggest these representations are stable features of early pretraining and have been shown to transfer to other models like Apertus-8B. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Reveals that key behavioral traits in LLMs are established very early in training, potentially enabling new safety interventions during pretraining.

RANK_REASON The cluster contains an academic paper detailing research findings on LLM interpretability and safety. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

COVERAGE [1]

arXiv cs.AI TIER_1 · Robert West · 2026-05-13 10:44

Tracing Persona Vectors Through LLM Pretraining

How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear …

COVERAGE [1]

Tracing Persona Vectors Through LLM Pretraining

RELATED TOPICS