Stealthy World Model Manipulation via Data Poisoning
Researchers have introduced SWAAP, a novel two-stage framework designed to manipulate learned world models in AI agents. This method exploits the training process by poisoning fine-tuning trajectories to corrupt the agent's planning and adaptation capabilities. SWAAP aims to induce low-return behaviors while maintaining stealth, making it difficult to detect. Evaluations on continuous-control tasks demonstrate significant performance degradation with minimal alteration to clean data, highlighting a practical vulnerability in world-model adaptation pipelines. AI
IMPACT Highlights a potential vulnerability in AI agents that use world models, necessitating new robustness methods for training data and learned dynamics.