Researchers have introduced Selective Eligibility Traces (S-trace), a novel method designed to enhance the reasoning capabilities of large language models within the Reinforcement Learning with Verifiable Rewards (RLVR) framework. This new approach addresses the limitations of existing critic-free algorithms like Group Relative Policy Optimization (GRPO) by moving beyond uniform credit assignment. S-trace selectively masks low-entropy tokens, enabling more efficient learning and fine-grained credit assignment, which has demonstrated superior performance and efficiency on models such as Qwen3. AI
影响 Introduces a more efficient method for training LLMs, potentially improving their reasoning and reducing computational costs.
排序理由 Academic paper introducing a novel method for improving LLM reasoning. [lever_c_demoted from research: ic=1 ai=1.0]
- Group Relative Policy Optimization
- Group Sequence Policy Optimization
- Large Language Models
- Qwen3-1.7B
- Qwen3-4B
- Qwen3-8B
- RLVR
- Selective Eligibility Traces
- GRPO
AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →