New research explores 'junking' LLMs via natural backdoors

By PulseAugur Editorial · [2 sources] · 2026-05-06 16:47

Researchers have explored the 'junking problem,' which involves finding naturally occurring token sequences within LLMs that can trigger harmful outputs without explicit adversarial prompts. This study formalizes the problem and uses a greedy random-search method to discover these 'natural backdoors.' While the problem is harder than traditional jailbreaking, the proposed strategy achieved a high success rate, indicating that these backdoors are present and easily recoverable. AI

IMPACT Identifies a new class of LLM vulnerabilities that could impact safety and alignment research.

RANK_REASON Academic paper detailing a new method for identifying vulnerabilities in LLMs.

Read on arXiv cs.LG →

paper
safety

AI-generated summary · Google Gemini · from 2 sources. How we write summaries →

COVERAGE [2]

arXiv cs.LG TIER_1 English(EN) · Marco Rando, Samuel Vaiter · 2026-05-07 04:00

On the Hardness of Junking LLMs

arXiv:2605.05116v1 Announce Type: new Abstract: Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instru…
arXiv cs.LG TIER_1 English(EN) · Samuel Vaiter · 2026-05-06 16:47

On the Hardness of Junking LLMs

Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction and optimizing small adversarial component…

COVERAGE [2]

On the Hardness of Junking LLMs

On the Hardness of Junking LLMs

RELATED ENTITIES

RELATED TOPICS