English(EN) Don't Let a Few Network Failures Slow the Entire AllReduce

新的OptCC算法最大限度地减少了网络故障对AllReduce的拖慢

作者 PulseAugur 编辑部 · [1 个来源] · 2026-06-02 04:00

研究人员开发了OptCC，这是一种旨在提高大规模GPU集群中AllReduce操作效率的新算法，尤其是在发生网络故障时。该算法接近完成时间的理论下限，显著减少了现有容错方法通常会看到的性能下降。实验表明，即使由于网络问题导致带宽大幅损失，OptCC仍能保持接近最优的性能，优于当前最先进的方法。 AI

影响减少了大规模AI模型训练基础设施中的训练作业中断，并提高了效率。

排序理由详细介绍分布式计算新算法的学术论文。[lever_c_demoted from research: ic=1 ai=0.7]

在 arXiv cs.LG 阅读 →

AI 生成摘要 · Google Gemini · 来自 1 个来源。我们如何撰写摘要 →

报道来源 [1]

arXiv cs.LG TIER_1 English(EN) · Peiqing Chen, Jiedong Jiang, Nengneng Yu, Yuefeng Wang, Sixian Xiong, Wei Wang, Zaoxing Liu · 2026-06-02 04:00

Don't Let a Few Network Failures Slow the Entire AllReduce

arXiv:2606.01680v1 Announce Type: cross Abstract: Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerout…

报道来源 [1]

Don't Let a Few Network Failures Slow the Entire AllReduce

相关实体

相关话题