Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
Researchers have developed a new method called Thinking-Based Non-Thinking (TNT) to address reward hacking in hybrid reasoning models. This approach aims to optimize computational efficiency by enabling models to decide when to engage in complex reasoning and when to provide a direct answer. TNT reportedly reduces token usage by approximately 50% while improving accuracy on mathematical benchmarks, achieving a better trade-off between performance and efficiency than existing methods. AI
IMPACT This method could lead to more efficient and accurate reasoning models, reducing computational costs for complex tasks.