Brief

last 24h

[2/2] 222 sources

Multi-source AI news clustered, deduplicated, and scored 0–100 across authority, cluster strength, headline signal, and time decay.

TOOL · arXiv cs.AI English(EN) · 14h

Hide to Guide: Learning via Semantic Masking

Researchers have developed a new technique called Semantic Masked Expert Policy Optimization (SMEPO) to improve reinforcement learning in language models. SMEPO addresses the issue of models learning to simply copy expert traces rather than genuine reasoning by semantically masking crucial information within those traces. This forces the model to reconstruct missing elements while still following the expert's overall problem-solving structure. SMEPO has demonstrated improvements in accuracy and significant reductions in training time across various domains, including math and coding. AI

IMPACT This method could lead to more efficient training of AI models for complex reasoning tasks, reducing computational costs and improving performance.
RESEARCH · arXiv cs.LG English(EN) · 5d · [2 sources]

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

Researchers have identified a key bottleneck in Reinforcement Learning from Verifiable Rewards (RLVR) that hinders LLM reasoning optimization. The study pinpoints rigid clipping decisions in standard hard-clipping methods as the cause, which discards valuable signals near the clipping threshold. To address this, they propose Near-boundary Stochastic Rescue (NSR), a simple modification that stochastically retains these slightly out-of-bound tokens, improving training stability and performance across various model sizes and architectures. AI

IMPACT Improves training stability and performance for LLM reasoning tasks, potentially enabling more robust and capable models.

Brief

Hide to Guide: Learning via Semantic Masking

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals