PulseAugur
EN
LIVE 10:18:12

Fanfiction subgenres used as universal jailbreaks for aligned LLMs

Researchers have developed a novel jailbreaking technique for aligned large language models by leveraging fanfiction subgenres. This method uses passages from twelve different Archive of Our Own (AO3) subgenres to embed harmful content within creative writing scenarios, bypassing traditional prompt-based defenses. The attack significantly increases the success rate of eliciting harmful responses, demonstrating that safety training has under-covered certain natural language registers. Additionally, a proposed four-turn extension, SAGA-A4, further enhances the attack's effectiveness. AI

IMPACT This research highlights a new vulnerability in LLM safety training, suggesting that current alignment methods may not adequately cover diverse natural language registers, potentially impacting future safety development.

RANK_REASON Academic paper detailing a new method for jailbreaking LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang ·

    Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

    arXiv:2606.04483v1 Announce Type: new Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing t…