PulseAugur
实时 10:22:59

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Researchers have developed a new defense mechanism called Tail-risk Intrinsic Geometric Smoothing (TIGS) to protect large language models from backdoor attacks. TIGS operates during inference without requiring model updates or external data, identifying and disrupting malicious attention patterns. Separately, a new attack framework named BadStyle has been introduced, which uses natural style triggers to create stealthy poisoned samples for LLMs. BadStyle aims to overcome limitations of previous attacks by ensuring naturalness, stabilizing payload injection, and operating under a realistic threat model. AI

影响 New defense and attack methods highlight ongoing security challenges for LLMs, potentially impacting deployment strategies and the need for robust security evaluations.

排序理由 The cluster contains two academic papers detailing new methods for attacking and defending large language models against backdoor threats.

在 arXiv cs.CL 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。 我们如何撰写摘要 →

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

报道来源 [2]

  1. arXiv cs.AI TIER_1 English(EN) · Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawend\'e F. Bissyand\'e, Xunzhu Tang ·

    Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

    arXiv:2604.24162v1 Announce Type: cross Abstract: Defending against backdoor attacks in large language models remains a critical practical challenge. Existing defenses mitigate these threats but typically incur high preparation costs and degrade utility via offline purification, …

  2. arXiv cs.CL TIER_1 English(EN) · Ting Liu ·

    Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

    The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings…