Brief · PulseAugur

TOOL · arXiv cs.AI English(EN) · 6h

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

Researchers have developed a method to improve the efficiency of multi-GPU machine learning training by overlapping computation and communication phases. The technique uses shared-memory allocation to manage computation kernel residency, ensuring enough on-chip resources are available for communication kernels. By assigning higher priority to communication streams, the approach effectively reduces total execution time by up to 25.5 percent across various NVIDIA and AMD GPUs without altering vendor libraries. AI

IMPACT Improves efficiency of distributed ML training, potentially reducing costs and accelerating research cycles.

NVIDIA
AMD
A100
MI250X