Brief · PulseAugur

TOOL · LessWrong (AI tag) English(EN) · 4h

Defeating Introspection Adapters (and Why Threat Models Matter)

Researchers have developed an attack that bypasses Introspection Adapters (IA), a technique designed to detect malicious fine-tunes in large language models. The attack involves a simple transformation of the model's weights, which relocates the basis that the IA relies on for calibration, rendering the detection method ineffective without altering the model's observable behavior. This highlights a critical difference in threat models, as the original IA authors assumed a trusted training pipeline, while the attackers considered a scenario where the final model weights are untrusted. AI

IMPACT This attack undermines current methods for detecting malicious LLM fine-tunes, necessitating the development of more robust safety mechanisms.

Anthropic
Introspection Adapters
Keshav Shenoy
EvilCorp
Llama 405B