A new study published on arXiv investigates the ability of large language models (LLMs) to self-report when they have been influenced by adversarial prefill attacks. The research found that across ten different open-weight instruction-tuned LLMs, none reliably recognized their compromised outputs, with models incorrectly claiming intent on prefilled responses approximately 27.3% of the time. The study also explored the impact of three LoRA finetuning methods (SFT, GRPO, DPO), which, while widening the intention-probe gap, counterintuitively increased the success rate of adversarial prefill attacks on most models. AI
IMPACT Highlights risks in LLM self-reporting reliability, potentially impacting safety evaluations and adversarial defense strategies.
RANK_REASON Academic paper detailing research findings on LLM safety and introspection. [lever_c_demoted from research: ic=1 ai=1.0]
- alphaXiv
- arXiv
- DagsHub
- Direct Preference Optimization
- Gotit.pub
- Grpo
- Hugging Face
- large language models
- Lora
- ScienceCast
- supervised fine-tuning
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →