A challenge hosted by Jane Street to find hidden backdoors in large language models has revealed insights into model vulnerabilities. The author successfully identified some backdoors using white-box methods after initial attempts with activation and prompting approaches proved unsuccessful. The challenge involved four models, including a fine-tuned Qwen2.5-7B-Instruct and three large DeepSeek-V3 Mixture-of-Experts models, with access to the larger models provided via an API. AI
IMPACT Highlights potential security risks in LLMs and the ongoing research into detecting and mitigating such vulnerabilities.
RANK_REASON The item details a challenge focused on identifying vulnerabilities (backdoors) in LLMs, which falls under AI safety research. [lever_c_demoted from research: ic=1 ai=1.0]
AI-generated summary · Google Gemini · from 1 sources. How we write summaries →