Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
Researchers have developed a novel jailbreaking technique for aligned large language models by leveraging fanfiction subgenres. This method uses passages from twelve different Archive of Our Own (AO3) subgenres to embed harmful content within creative writing scenarios, bypassing traditional prompt-based defenses. The attack significantly increases the success rate of eliciting harmful responses, demonstrating that safety training has under-covered certain natural language registers. Additionally, a proposed four-turn extension, SAGA-A4, further enhances the attack's effectiveness. AI
IMPACT This research highlights a new vulnerability in LLM safety training, suggesting that current alignment methods may not adequately cover diverse natural language registers, potentially impacting future safety development.