PulseAugur
EN
LIVE 05:09:03

OLMo 3 7B training reveals structured harmfulness directions

Researchers have analyzed the development of harmfulness representations within the OLMo 3 7B model during its training process. They identified distinct but related linear activation directions for various harmfulness subcategories, observing that these directions evolve and stabilize over time. The study found that in-distribution evaluations can be misleading, emphasizing the need for out-of-distribution testing, and demonstrated that late-stage training directions can effectively steer the model's behavior. AI

IMPACT Reveals insights into how harmful concepts are represented and evolve during LLM training, potentially informing future safety research.

RANK_REASON Technical report detailing methodology and findings on model training dynamics and harmfulness representations. [lever_c_demoted from research: ic=1 ai=1.0]

Read on LessWrong (AI tag) →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

OLMo 3 7B training reveals structured harmfulness directions

COVERAGE [1]

  1. LessWrong (AI tag) TIER_1 English(EN) · Daniele Pace ·

    Harmfulness Directions in OLMo

    <img alt="pca_centroids_animation.gif" src="https://res.cloudinary.com/lesswrong-2-0/image/upload/v1778859923/lexical_client_uploads/kbq9rgmrhedzrnngtmkn.gif" /><h1><span>Introduction</span></h1><p><span>This work was conducted as part of the MARS 4.0 program, supervised by Loren…