PulseAugur
实时 10:51:23

JD.com's AGPO enhances LLM reasoning and search ads with asymmetric policy optimization

Researchers have introduced Asymmetric Group Policy Optimization (AGPO), a novel reinforcement learning technique designed to improve the reasoning capabilities of large language models. AGPO aims to prevent the narrowing of reasoning patterns often seen in current methods by suppressing incorrect paths and focusing on rare, correct ones. Experiments on mathematical benchmarks show AGPO achieves state-of-the-art accuracy and improves performance at scale. The method has also been applied to optimize search ads relevance at JD, leading to significant gains in downstream models. AI

影响 This new optimization technique could enhance LLM reasoning accuracy and efficiency, potentially improving applications in areas like search relevance.

排序理由 This is a research paper detailing a new method for improving LLM reasoning. [lever_c_demoted from research: ic=1 ai=1.0]

在 arXiv cs.AI 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。 我们如何撰写摘要 →

JD.com's AGPO enhances LLM reasoning and search ads with asymmetric policy optimization

报道来源 [1]

  1. arXiv cs.AI TIER_1 English(EN) · Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting, Ming Pang ·

    AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

    arXiv:2605.05826v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sa…