EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing
Researchers have developed EST-PRM, a new framework designed to stress-test process reward models (PRMs) used in language model training. PRMs assume their scores remain stable even when reasoning steps are altered while the final answer is preserved, an assumption this framework challenges. By introducing transformations like step inflation and reordering, EST-PRM reveals vulnerabilities in PRMs, showing how their scores can inflate or lose sensitivity to correctness. Evaluations on several benchmark datasets demonstrated significant differences in how various PRMs, including Math-Shepherd and Qwen2.5-Math-PRM, respond to these perturbations, highlighting the need for more robust reward modeling. AI
IMPACT Reveals critical vulnerabilities in AI reward models, potentially impacting future LLM training methodologies and safety evaluations.