Semantic-Preserving Prompt Hijacking: A Black-Box Adversarial Attack on Auto-Prompt Optimization
Researchers have developed a new black-box adversarial attack method called Adaptive Greedy Local Search, designed to hijack the auto-suggestion optimization modules within large language models. This technique works by subtly altering user input to cause semantic shifts in the model's output while maintaining a high degree of semantic similarity to the original text. Experiments on various LLMs indicate that this method is more successful than existing approaches in achieving its attack goals under similar semantic constraints. AI
IMPACT Highlights a vulnerability in LLM auto-optimization features, potentially impacting model security and trustworthiness.