Researchers have developed a new defense mechanism called Tail-risk Intrinsic Geometric Smoothing (TIGS) to protect large language models from backdoor attacks. TIGS operates during inference without requiring model updates or external data, identifying and disrupting malicious attention patterns. Separately, a new attack framework named BadStyle has been introduced, which uses natural style triggers to create stealthy poisoned samples for LLMs. BadStyle aims to overcome limitations of previous attacks by ensuring naturalness, stabilizing payload injection, and operating under a realistic threat model. AI
Summary written by None from 2 sources. How we write summaries →
IMPACT New defense and attack methods highlight ongoing security challenges for LLMs, potentially impacting deployment strategies and the need for robust security evaluations.
RANK_REASON The cluster contains two academic papers detailing new methods for attacking and defending large language models against backdoor threats.