Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 2d · [3 sources]

Latent-space Attacks for Refusal Evasion in Language Models

Researchers have developed PsychoSafe, a framework to improve how large language models refuse harmful requests by employing psychologically informed communication strategies. This approach reframes refusals as supportive interactions, enhancing external resource referral and psychological grounding. Separately, another study introduces Latent-space Attacks for Refusal Evasion, which analyzes how to bypass LLM safety mechanisms by manipulating internal model representations to suppress refusal behavior. AI

IMPACT Developments in LLM refusal strategies and evasion techniques highlight ongoing challenges in AI safety and alignment.

Giorgio Piras
language models
Controlled Latent-space Evasion
Large Language Models
Latent-space Attacks for Refusal Evasion
PsychoSafe
Qwen 3.5 27B