RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories
Researchers have developed RippleBench-Maker, an automated pipeline designed to identify and quantify the ripple effects of targeted interventions on language models. This system uses existing knowledge repositories, like Wikipedia, to generate questions at varying semantic distances from a source concept. When applied to eight different unlearning methods on models such as Llama3-8B-Instruct, the system revealed that accuracy drops are largest near the target concept and decrease with semantic distance. Notably, the propagation profiles of these ripple effects were found to be consistent across different base models, suggesting they are a property of the unlearning method itself. AI
IMPACT Provides a standardized method to measure and compare the unintended consequences of AI model modifications, crucial for safety and reliability.