Researchers have developed a new defense mechanism called Tail-risk Intrinsic Geometric Smoothing (TIGS) to protect large language models from backdoor attacks. TIGS operates during inference without requiring model updates or external data, identifying and disrupting malicious attention patterns. Separately, a new attack framework named BadStyle has been introduced, which uses natural style triggers to create stealthy poisoned samples for LLMs. BadStyle aims to overcome limitations of previous attacks by ensuring naturalness, stabilizing payload injection, and operating under a realistic threat model. AI
影响 New defense and attack methods highlight ongoing security challenges for LLMs, potentially impacting deployment strategies and the need for robust security evaluations.
排序理由 The cluster contains two academic papers detailing new methods for attacking and defending large language models against backdoor threats.
AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →