New metric measures LLM ideological depth and refusal causes

By PulseAugur Editorial · [1 sources] · 2026-06-03 04:00

Researchers have introduced a new metric called "ideological depth" to measure the internal political representations within large language models. This metric assesses a model's ability to follow political instructions and the richness of its internal features, using sparse autoencoders for analysis. Experiments on open-weight LLMs revealed that models with greater steerability activated significantly more political features, while others increased refusals, suggesting capability deficits rather than fixed safety rules can cause these refusals. AI

IMPACT Introduces a new framework for understanding and potentially improving LLM behavior on sensitive topics.

RANK_REASON Academic paper introducing a new metric and experimental findings. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

New metric measures LLM ideological depth and refusal causes

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Shariar Kabir · 2026-06-03 04:00

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

arXiv:2508.21448v3 Announce Type: replace Abstract: Large language models (LLMs) sometimes refuse to follow benign instructions, such as declining to argue a political position or adopt a stated persona, and such refusals are commonly read as safety guardrails at work. We ask whe…

COVERAGE [1]

When Models Refuse: Political Steerability and Feature Richness as Measures of Ideological Depth

RELATED ENTITIES

RELATED TOPICS