SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector
Researchers have developed SAGE (Spectral Activation-GEometry Sanitization), a novel post-hoc method to improve the retention capabilities of large language models (LLMs) after unlearning processes. Current unlearning techniques often sacrifice model performance on retained data. SAGE addresses this by analyzing the activation geometry of retained data to correct the unlearning update vector without needing to rerun the entire unlearning pipeline. This approach consistently mitigates the trade-off between forgetting and retention across various unlearning methods and model scales. AI
IMPACT Enhances LLM unlearning techniques, potentially leading to more robust and performant models after data removal.