AI role confusion enables 60% success rate for prompt injection attacks

By PulseAugur Editorial · [1 sources] · 2026-06-01 04:00

Researchers have identified prompt injection in large language models as a consequence of "role confusion," where models mistake injected text for legitimate input due to its perceived origin rather than its labeled role. This confusion allows malicious commands hidden within seemingly innocuous text to hijack AI agents. The study introduces "role probes" to measure this phenomenon and demonstrates a "CoT Forgery" attack that achieves a 60% success rate by fabricating reasoning, highlighting that the model's perception of the speaker's role directly predicts attack vulnerability. AI

IMPACT Identifies a fundamental vulnerability in LLM role perception, potentially impacting agent security and requiring new defense mechanisms.

RANK_REASON Academic paper detailing a new attack vector and mechanism for LLMs. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.AI →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

AI role confusion enables 60% success rate for prompt injection attacks

COVERAGE [1]

arXiv cs.AI TIER_1 English(EN) · Charles Ye, Jasmine Cui, Dylan Hadfield-Menell · 2026-06-01 04:00

Prompt Injection as Role Confusion

arXiv:2603.12277v5 Announce Type: replace-cross Abstract: LLMs see the world as a single stream of text, partitioned into roles like or . We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in …

COVERAGE [1]

Prompt Injection as Role Confusion

RELATED ENTITIES

RELATED TOPICS