How does the new abliteration tool Apostate compare with others? - Abliterlitics
A new tool called Apostate has been developed to "abliterate" safety training in large language models, with benchmarks comparing it against existing tools like Heretic and Huihui. While Heretic performed slightly better, achieving 100% success in removing refusals with minimal parameter changes, Apostate and Huihui also demonstrated strong performance at 98%. The analysis revealed that these tools find different "refusal directions" within the Qwen 2.5 7B model, indicating that safety training does not have a single point of failure. AI
IMPACT New tools for modifying LLM safety training emerge, suggesting multiple pathways to bypass safety measures.