Brief · PulseAugur

TOOL · arXiv cs.CL English(EN) · 5d

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

Researchers have introduced a new metric called "ideological depth" to measure the internal political representations within large language models. This metric assesses a model's ability to follow political instructions and the richness of its internal features, using sparse autoencoders for analysis. Experiments on open-weight LLMs revealed that models with greater steerability activated significantly more political features, while others increased refusals, suggesting capability deficits rather than fixed safety rules can cause these refusals. AI

IMPACT Introduces a new framework for understanding and potentially improving LLM behavior on sensitive topics.

Large language models
sparse autoencoders
Shariar Kabir