The prompt injection attacks that worry me most aren't exploiting safety training. They're exploiting general-purpose training.
A security researcher observed that the most effective prompt injection attacks on AI models exploit their general-purpose training, rather than specific safety alignment. These attacks leverage the model's inherent helpfulness and conversational coherence to trick it into acting against user intent by reframing the situation. The researcher suggests that improving alignment might not effectively counter these threats, as the vulnerability lies in the core training that makes models conversational and helpful. AI
IMPACT Suggests a shift in AI security focus from alignment to core training methods to counter prompt injection.