Brief · PulseAugur

TOOL · arXiv cs.LG English(EN) · 7h

Stealthy World Model Manipulation via Data Poisoning

Researchers have introduced SWAAP, a novel two-stage framework designed to manipulate learned world models in AI agents. This method exploits the training process by poisoning fine-tuning trajectories to corrupt the agent's planning and adaptation capabilities. SWAAP aims to induce low-return behaviors while maintaining stealth, making it difficult to detect. Evaluations on continuous-control tasks demonstrate significant performance degradation with minimal alteration to clean data, highlighting a practical vulnerability in world-model adaptation pipelines. AI

IMPACT Highlights a potential vulnerability in AI agents that use world models, necessitating new robustness methods for training data and learned dynamics.

arXiv
machine learning
SWAAP