Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 8h

Constitutional Value Potentials: reading and steering internal priority margins in language models

Researchers have developed a new method called Constitutional Value Potentials (CVP) to read and steer the internal priorities of language models. CVP learns a scalar potential for each value from a model's hidden state, indicating its internal pressure to preserve that value. This allows for the identification of priority margins, which are crucial for understanding how models handle value conflicts. The system predicts conflict violations with high accuracy and can generalize across different model scales, suggesting that these priorities are accessible within the model's activation space rather than solely through output behavior. AI

IMPACT Enables deeper understanding and control over LLM value alignment, potentially improving safety and reliability.

Hugging Face
arXiv
DagsHub
Qwen2.5
alphaXiv
CORE Recommender
ScienceCast
Gotit.pub
CatalyzeX Code Finder for Papers
Influence Flower
arXivLabs
IArxiv Recommender
Constitutional Value Potentials