PulseAugur
LIVE 12:28:18
research · [1 source] ·
0
research

Anthropic fellowship researchers find backdoor attacks can poison AI classifiers

Researchers have investigated how to implant backdoors into constitutional classifiers by poisoning their fine-tuning datasets. They discovered that a small, fixed number of poisoned examples can be sufficient to create a backdoor, irrespective of the overall training set size. While such poisoning typically reduces the classifier's robustness, this effect can be minimized by augmenting some training data with prompt injections or mutated trigger phrases, making the backdoor harder for red-teamers to detect. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT New research demonstrates a subtle method for compromising AI safety classifiers, potentially impacting red-teaming effectiveness.

RANK_REASON Academic paper detailing a new method for poisoning AI model training data.

Read on LessWrong (AI tag) →

Anthropic fellowship researchers find backdoor attacks can poison AI classifiers

COVERAGE [1]

  1. LessWrong (AI tag) TIER_1 · Chase Bowers ·

    Poisoning Fine-tuning Datasets of Constitutional Classifiers

    <p><span>The primary contributors to this work are Chase Bowers</span><span class="math-tex"></span><span>, Faizan Ali</span><span class="math-tex"></span><span>, John Hughes</span><span class="math-tex"></span><span>, Jerry Wei</span><span class="math-tex"></span><span>, and Fab…