PulseAugur
LIVE 15:24:27
research · [1 source] ·
0
research

New DDRL framework reduces noise in test-time reinforcement learning for math reasoning

Researchers have identified that test-time reinforcement learning (TTRL) for math reasoning is susceptible to amplified noise from ambiguous pseudo-labels. They observed that responses with medium consistency create an ambiguity region, which is a primary source of reward noise that can be further amplified by group-relative advantage estimation. To address this, a new framework called Debiased and Denoised test-time Reinforcement Learning (DDRL) has been proposed, which uses frequency-based sampling to exclude ambiguous samples and a debiased advantage estimation to remove optimization bias. AI

Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →

IMPACT Introduces a new method to improve the robustness of LLMs in math reasoning tasks by mitigating noise amplification during inference.

RANK_REASON This is a research paper detailing a new framework for improving test-time reinforcement learning in math reasoning.

Read on arXiv cs.CL →

New DDRL framework reduces noise in test-time reinforcement learning for math reasoning

COVERAGE [1]

  1. arXiv cs.CL TIER_1 · Ran He ·

    Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

    Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and …