Looking for backdoors in Jane Street LLMs
A participant in Jane Street's LLM backdoor challenge shared their experience attempting to uncover hidden triggers in fine-tuned models. Initially, prompting strategies proved unsuccessful in revealing the backdoors. The challenge involved both a smaller, locally runnable Qwen2.5-7B-Instruct model and larger DeepSeek-V3 Mixture-of-Experts models accessed via API, with the latter proving particularly difficult to analyze. AI
IMPACT Details a novel approach to identifying vulnerabilities in large language models, potentially informing future AI security research.