Researchers have developed a novel jailbreaking technique for aligned large language models by leveraging fanfiction subgenres. This method uses passages from twelve different Archive of Our Own (AO3) subgenres to embed harmful content within creative writing scenarios, bypassing traditional prompt-based defenses. The attack significantly increases the success rate of eliciting harmful responses, demonstrating that safety training has under-covered certain natural language registers. Additionally, a proposed four-turn extension, SAGA-A4, further enhances the attack's effectiveness. AI
IMPACT This research highlights a new vulnerability in LLM safety training, suggesting that current alignment methods may not adequately cover diverse natural language registers, potentially impacting future safety development.
RANK_REASON Academic paper detailing a new method for jailbreaking LLMs. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →