LLMs fail to reliably self-report adversarial prefill attacks, study finds

By PulseAugur Editorial · [1 sources] · 2026-06-22 17:56

A new study published on arXiv investigates the ability of large language models (LLMs) to self-report when they have been influenced by adversarial prefill attacks. The research found that across ten different open-weight instruction-tuned LLMs, none reliably recognized their compromised outputs, with models incorrectly claiming intent on prefilled responses approximately 27.3% of the time. The study also explored the impact of three LoRA finetuning methods (SFT, GRPO, DPO), which, while widening the intention-probe gap, counterintuitively increased the success rate of adversarial prefill attacks on most models. AI

IMPACT Highlights risks in LLM self-reporting reliability, potentially impacting safety evaluations and adversarial defense strategies.

RANK_REASON Academic paper detailing research findings on LLM safety and introspection. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

paper
safety

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs fail to reliably self-report adversarial prefill attacks, study finds

COVERAGE [1]

arXiv cs.CL TIER_1 English(EN) · Taegyoon Kim · 2026-06-22 17:56

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open…

COVERAGE [1]

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

RELATED ENTITIES

RELATED TOPICS