ASTRA-sim and Chakra:
Enabling Software-Hardware Co-Design Exploration for Distributed Machine Learning Platforms

Organizer

Synergy Lab, Georgia Institute of Technology

Acknowledgments: Chakra and ASTRA-sim are a community effort with technical insights and code contributions from Meta, Intel, AMD, NVIDIA, HPE, Keysight, as well as several academic institutions.

Overview

In this tutorial, we will educate the research community about the challenges in the emerging domain of distributed machine learning, demonstrate the capabilities of Chakra Execution Trace and ASTRA-sim with examples and discuss ongoing development efforts.

NEW – In this tutorial, we will (i) introduce details about the Chakra Execution Traces, (ii) running custom collective communications via MSCCL-IR in ASTRA-sim, (iii) and modeling LLM training/inference using ASTRA-sim.

Date/Location

Nov 3, 2024, 1–5 pm CST
AT&T Hotel and Conference Center Info

Description

As Artificial Intelligence (AI) models are scaling at an unprecedented rate, Machine Learning (ML) execution heavily relies on Distributed ML over customized neural accelerator (e.g., GPU or TPU)-based High-Performance Computing (HPC) platforms connected via high-speed interconnects (e.g., NVLinks). Examples today include NVIDIA’s HGX, Google’s Cloud TPU, and Meta’s Research Supercluster. Distributed Deep Neural Network (DNN) execution involves a complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective communication algorithm, network topology, remote memory accesses, and the accelerator endpoint. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex intertwined co-design space to (i) architect future platforms, (ii) develop novel parallelism schemes to support efficient training of future DNN models, and (iii) develop novel fabrics for AI systems. As an ongoing collaboration between Georgia Tech and several companies, we have been jointly developing (1) a comprehensive methodology to capture arbitrary distributed ML workloads, named Chakra Execution Trace and (ii) a detailed cycle-accurate distributed AI simulator called ASTRA-sim.

Chakra Execution Trace (Chakra ET) is a community-driven effort in MLCommons to standardize the representation of distributed ML workloads. The standardization effort via Chakra ET would harmonize the utilization of multiple upstream applications (e.g., trace profiler or trace generator) and distinct downstream tasks (e.g., simulators or replay). Chakra ET captures arbitrary distributed ML workloads by leveraging a directed acyclic graph representation of compute, communication, and remote memory nodes.

ASTRA-sim models the co-design space of distributed ML described above and schedules the compute-communication interactions over plug-and-play computation, network, and remote memory simulators. It enables a systematic study of bottleneck detection and futuristic system evaluation at the software and hardware levels for scaling distributed ML. ASTRA-sim leverages the Chakra format to describe arbitrary distributed ML workloads. It uses a Google TPU-like simulator as its computation model and provides a suite of network models (analytical network, Garnet, and ns-3) for the choice of simulation speed and fidelity.

Target Audience

Any researcher with the interest in full-stack, large-scale AI/ML simulation.

Organizers

	Tushar Krishna (Georgia Tech) Email Website LinkedIn Tushar Krishna is an Associate Professor in the School of Electrical and Computer Engineering at Georgia Tech. He is currently also a Visiting Associate Professor at the Department of Electrical Engineering and Computer Science at MIT. He has a Ph.D. in Electrical Engineering and Computer Science from MIT (2014), a M.S.E in Electrical Engineering from Princeton University (2009), and a B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Delhi (2007). Before joining Georgia Tech in 2015, Dr. Krishna spent a year as a researcher at the VSSAD group at Intel, Massachusetts. Dr. Krishna's research spans computer architecture, interconnection networks, networks-on- chip (NoC), and AI/ML accelerator systems, with a focus on optimizing data movement in modern computing platforms. His research is funded via multiple awards from NSF, DARPA, IARPA, SRC (including JUMP2.0), Department of Energy, Intel, Google, Meta/Facebook, Qualcomm and TSMC.
	William Won (Will) (Georgia Tech) Email Website LinkedIn William Won is a Ph.D. student in the College of Computing at Georgia Tech. He received M.S. in Computer Science at Georgia Tech in 2022, and B.S. in Computer Science and Enginnering at Seoul National University in 2019. His research interests include training and inference of distributed machine learning, simulation of distributed machine learning workloads, collective communication optimizations, and machine learning algorithms.
	Joongun Park (Georgia Tech) Email Website LinkedIn Joongun Park is a Postdoctoral Researcher at Georgia Tech, focusing on hardware and system security, high-performance computing, and optimizing distributed machine learning systems. He is currently conducting applied research to develop secure and efficient design strategies for large-scale systems, with practical experience in both real-world and simulated environments. Joongun has collaborated with Intel on developing next-generation supercomputer architectures for an IARPA project. He holds a Ph.D. (2023), M.S. (2018), and B.S. (2016) in Computer Science from KAIST.
	Taekyung Heo (NVIDIA) Website LinkedIn Taekyung Heo is a Senior HPC Middleware Developer at NVIDIA, where he specializes in distributed machine learning systems, performance modeling, and memory systems. His work emphasizes bridging the gap between theory and practical implementation through hardware-software co-design, focusing on scaling AI workloads across thousands of GPUs. He is one of the main developers of Chakra. Taekyung holds a Ph.D. in Computer Science from KAIST (2022), an M.Sc. in Computer Science from KAIST (2016), and a B.Sc. in Computer Engineering from Sungkyunkwan University (2014).
	Vinay Ramakrishnaiah (AMD) Email LinkedIn Vinay Ramakrishnaiah is a Senior Member of Technical Staff at AMD working in the area of hardware-software codesign. His research interests include Artificial Intelligence (AI) at scale, scale-out networks, GPU schedulers, and developing and evaluating applications for emerging hardware architectures. Prior to joining AMD, Vinay was a staff scientist and principal investigator at the Los Alamos National Laboratory where he worked on multiple exascale computing projects, signal processing optimizations for satellites, and led the co-design summer school. Vinay obtained his Ph.D. in Electrical and Computer Engineering from the University of Wyoming with a research focus in applications of high-performance computing to antenna signal processing.

Schedule

Time (CST)	Agenda	Presenter	Resources
	[ Introduction ]
1:00 pm	Introduction to Distributed ML	Tushar (GT)	Slide Video
1:30 pm	Overview: ASTRA-sim and Chakra	Tushar (GT)	Slide Video

	[ Chakra ]
1:40 pm	Chakra Execution Trace	Taekyung (NVIDIA)	Slide Video

	[ ASTRA-sim ]
2:00 pm	Workload Layer	Taekyung (NVIDIA)	Slide Video
2:20 pm	System Layer	Will (GT/AMD)	Slide Video
2:40 pm	Network Layer	Will (GT/AMD)	Slide Video

3:00 pm	[ Coffee Break ]

	[ Demo ]
3:30 pm	Chakra + ASTRA-sim Demo	Joongun (GT)	Slide Video
3:50 pm	(Supplementary) NS-3 Demo	Jinsun (GT)	Slide Video

	[ ASTRA-sim Updates ]
4:10 pm	ASTRA-sim New Features	Vinay (AMD), Will (GT/AMD)	Slide Video
4:40 pm	ASTRA-sim Wiki and Validation	Will (GT/AMD)	Slide Video

	[ Closing ]
4:50 pm	Closing Remarks	Tushar (GT)	Slide Video

Additional Resources

Installation

MLCommons Chakra Working Group

Website

ASTRA-sim2.0 Paper

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, “ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale,” In Proc. of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ‘23), 2023.

paper

Chakra Paper

Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zheng, Brian Coutinho, Saeed Rashidi, Changhai Man, and Tushar Krishna, “Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces,” In arXiv:2305.14516 [cs.LG], 2023.

paper

CollectiveAPI Paper

Jinsun Yoo, William Won, Meghan Cowan, Nan Jiang, Benjamin Klenk, Srinivas Sridharan†, and Tushar Krishna, “Towards a Standardized Representation for Deep Learning Collective Algorithms,” In Proc. of the 2024 IEEE Symposium on High-Performance Interconnects (HOTI ‘24), 2024.

paper

Repositories

ASTRA-sim Chakra Analytical Network