PulseAugur
EN
LIVE 09:50:26

LLMs fail to reliably self-report adversarial prefill attacks, study finds

A new study published on arXiv investigates the ability of large language models (LLMs) to self-report when they have been influenced by adversarial prefill attacks. The research found that across ten different open-weight instruction-tuned LLMs, none reliably recognized their compromised outputs, with models incorrectly claiming intent on prefilled responses approximately 27.3% of the time. The study also explored the impact of three LoRA finetuning methods (SFT, GRPO, DPO), which, while widening the intention-probe gap, counterintuitively increased the success rate of adversarial prefill attacks on most models. AI

IMPACT Highlights risks in LLM self-reporting reliability, potentially impacting safety evaluations and adversarial defense strategies.

RANK_REASON Academic paper detailing research findings on LLM safety and introspection. [lever_c_demoted from research: ic=1 ai=1.0]

Read on arXiv cs.CL →

AI-generated summary · Google Gemini · from 1 sources. How we write summaries →

LLMs fail to reliably self-report adversarial prefill attacks, study finds

COVERAGE [1]

  1. arXiv cs.CL TIER_1 English(EN) · Taegyoon Kim ·

    Can LLMs Reliably Self-Report Adversarial Prefills, and How?

    Prior work shows that large language models (LLMs) exhibit introspective capability on benign tasks. We extend the question to safety contexts and examine how reliably a model can recognize that its own prior response was elicited by an adversarial prefill attack. Across ten open…