ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling
Researchers have released ASTRA-sim 3.0, an updated open-source simulator designed for distributed machine learning. The new version enhances simulation fidelity by modeling GPU execution and infrastructure at a fine-grained, cache-line level. It also introduces InfraGraph, a standardized representation for network infrastructure, enabling more detailed design space exploration for collective algorithms and hardware architectures. AI
IMPACT Enables more accurate simulation of distributed ML workloads, potentially accelerating the design of efficient AI infrastructure and algorithms.