A new research paper challenges the assumption that "thinking tokens" in reasoning models inherently improve safety. The study found that the refusal or compliance outcome of models like GPT-OSS, Qwen, Olmo, and Phi is highly predictable from the very first token, even before visible deliberation occurs. The research suggests that the "thinking" process is more like prefix completion, with outcomes rarely changing after the initial stages, and that current safety interventions may inadvertently suppress genuine deliberation by promoting over-refusal. AI
IMPACT Challenges the assumption that AI deliberation improves safety, suggesting a need for new methods to induce genuine safety considerations.
RANK_REASON The cluster contains a research paper published on arXiv discussing findings about AI model safety.
Read on Hugging Face Daily Papers →
- alphaXiv
- arXiv
- CatalyzeX
- DagsHub
- Gotit.pub
- GPT-OSS
- Hugging Face
- IArxiv
- Olmo
- Phi Llm
- Qwen
- ScienceCast
AI-generated summary · Google Gemini · from 3 sources. How we write summaries →