Protecting the Trace: A Principled Black-Box Approach Against Distillation Attacks
Researchers have developed a new method called TraceGuard to protect proprietary AI models from distillation attacks. This approach treats antidistillation as a Stackelberg game, providing a theoretical foundation for poisoning reasoning traces to hinder student model learning. TraceGuard is an efficient, black-box technique that poisons sentences crucial for the teacher model's reasoning, aiming to safeguard intellectual privacy and AI safety without significantly degrading the teacher model's performance. AI
IMPACT Provides a theoretical framework and practical method to protect proprietary AI models from intellectual property theft via distillation.