Researchers have discovered that the persona of an instruction-tuned chat model plays a crucial role in its refusal behavior. By analyzing Qwen2.5-7B-Instruct and Llama-3.1-8B-Instruct, they found that a compliant persona acts as a gatekeeper for refusal. When a compliant persona direction was amplified, refusal rates dropped significantly, particularly in Llama-3.1-8B-Instruct, from 97% to 2%. While refusal can be partially restored in later layers, it is ultimately gated downstream of its initial computation, indicating that treating refusal in isolation misses its dependence on the model's persona. AI
IMPACT Understanding how persona influences model refusal is key to developing more controllable and predictable AI systems.
RANK_REASON The cluster contains an academic paper detailing new research findings on the behavior of AI models. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →