Latent-space Attacks for Refusal Evasion in Language Models
Researchers have developed PsychoSafe, a framework to improve how large language models refuse harmful requests by employing psychologically informed communication strategies. This approach reframes refusals as supportive interactions, enhancing external resource referral and psychological grounding. Separately, another study introduces Latent-space Attacks for Refusal Evasion, which analyzes how to bypass LLM safety mechanisms by manipulating internal model representations to suppress refusal behavior. AI
IMPACT Developments in LLM refusal strategies and evasion techniques highlight ongoing challenges in AI safety and alignment.