English(EN) Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

新的自玩方法在没有人类数据的情况下改进LLM

作者 PulseAugur 编辑部 · [2 个来源] · 2026-05-08 16:35

两篇新的研究论文介绍了在没有人类监督的情况下微调大型语言模型的创新自玩算法。第一个，TPAW，使用一种基于团队的方法，模型与历史检查点竞争和协作，采用自适应加权响应和玩家来提高稳定性和效率。第二个，SPEAR，专注于在线联邦微调与实时反馈，使用优势加权细化和置信度加权非似然性来训练从部分反馈派生的对比对，使其能够高效地用于边缘设备。 AI

影响这些自玩方法可以减少对昂贵的人工标注进行LLM对齐的依赖，从而可能加速模型的开发和部署。

排序理由两篇学术论文提出了使用自玩技术微调LLM的新方法。

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 2 个来源。我们如何撰写摘要 →

报道来源 [2]

arXiv cs.CL TIER_1 English(EN) · Jing Li · 2026-05-11 03:17

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

While recent self-training approaches have reduced reliance on human-labeled data for aligning LLMs, they still face critical limitations: (i) sensitivity to synthetic data quality, leading to instability and bias amplification in iterative training; (ii) ineffective optimization…
arXiv cs.LG TIER_1 English(EN) · Christopher G. Brinton · 2026-05-08 16:35

Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

Recent works have advanced feedback-based learning systems, whereby a foundation model is able to intake incoming feedback (e.g., a user) to self-improve, creating a self-loop system of training. However, existing works are limited in needing to consider an offline setup to allow…

报道来源 [2]

Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

相关实体

相关话题