Researchers have analyzed the development of harmfulness representations within the OLMo 3 7B model during its training process. They identified distinct but related linear activation directions for various harmfulness subcategories, observing that these directions evolve and stabilize over time. The study found that in-distribution evaluations can be misleading, emphasizing the need for out-of-distribution testing, and demonstrated that late-stage training directions can effectively steer the model's behavior. AI
IMPACT Reveals insights into how harmful concepts are represented and evolve during LLM training, potentially informing future safety research.
RANK_REASON Technical report detailing methodology and findings on model training dynamics and harmfulness representations. [lever_c_demoted from research: ic=1 ai=1.0]
- Alpaca
- BeaverTails
- Bryan Maruyama
- Daniele Pace
- Hannes Whittingham
- LessWrong
- Lorenzo Pacchiardi
- MARS 4.0
- Mikhail Mironov
- OLMo 3 7B
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →