Brief · PulseAugur

TOOL · LessWrong (AI tag) English(EN) · 9h

Harmfulness Directions in OLMo

Researchers have analyzed the development of harmfulness representations within the OLMo 3 7B model during its training process. They identified distinct but related linear activation directions for various harmfulness subcategories, observing that these directions evolve and stabilize over time. The study found that in-distribution evaluations can be misleading, emphasizing the need for out-of-distribution testing, and demonstrated that late-stage training directions can effectively steer the model's behavior. AI

IMPACT Reveals insights into how harmful concepts are represented and evolve during LLM training, potentially informing future safety research.

Alpaca
LessWrong
BeaverTails
OLMo 3 7B
Bryan Maruyama
Mikhail Mironov
Daniele Pace
MARS 4.0
Lorenzo Pacchiardi
Hannes Whittingham