HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters
Researchers have developed HetCCL, a new framework designed to improve collective communication efficiency in heterogeneous computing clusters used for training large language models. This framework addresses the limitations of existing systems by enabling efficient peer-to-peer transport across different vendors' hardware, reducing overhead and eliminating host-device memory copy costs. HetCCL's innovative border-communicator mechanism and hierarchical topology abstraction allow for vendor-independent reduction operations and optimized data transfer, leading to significant bandwidth improvements and faster end-to-end training times. AI
IMPACT Enables more efficient and cost-effective training of large language models on diverse hardware setups.