Constitutional Value Potentials: reading and steering internal priority margins in language models
Researchers have developed a new method called Constitutional Value Potentials (CVP) to read and steer the internal priorities of language models. CVP learns a scalar potential for each value from a model's hidden state, indicating its internal pressure to preserve that value. This allows for the identification of priority margins, which are crucial for understanding how models handle value conflicts. The system predicts conflict violations with high accuracy and can generalize across different model scales, suggesting that these priorities are accessible within the model's activation space rather than solely through output behavior. AI
IMPACT Enables deeper understanding and control over LLM value alignment, potentially improving safety and reliability.