[Tutorial at MICRO 2024]
ASTRA-sim and Chakra: Enabling Software-Hardware Co-Design Exploration for Distributed Machine Learning Platforms
We’re running ASTRA-sim/Chakra tutorial at MICRO 2024.
Nov 3, 2024, 1-5 pm (CST)
AT&T Hotel and Conference Center, Austin, TX.
Overview
ASTRA-sim is a distributed machine learning system simulator. It enables the systematic study of challenges in modern deep learning systems, allowing for the exploration of bottlenecks and the development of efficient methodologies for large DNN models across diverse future platforms. Using ASTRA-sim’s APIs, you can plug-and-play with any network, compute, or memory simulator backends.
Below is a concise visual summary of our simulator:
Contact Us
- For any questions about using ASTRA-sim, you can email the ASTRA-sim User Mailing List: astrasim-users@googlegroups.com
- To join the mailing list, please fill out this form.
Useful Links
Documentation
- For information on how to use ASTRA-sim, please visit our Wiki.
- For Chakra MLCommons working group, please visit here.
GitHub Repositories
- ASTRA-sim
- Chakra
- Network Simulators
- Compute Simulators
- Roofline, SCALE-sim, and more in progress
- Memory Simulators
Papers
The full description of the tool and its strength can be found in these papers.
@INPROCEEDINGS{won2023astrasim2,
author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale},
year={2023},
volume={},
number={},
pages={283-294},
keywords={Training;Semiconductor device modeling;Analytical models;Network topology;Systems modeling;Throughput;Data models;Distributed training;High-performance training;Multi-dimensional network;Disaggregated memory system},
doi={10.1109/ISPASS57527.2023.00035}}
@INPROCEEDINGS{rashidi2020astrasim,
author={Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
booktitle={2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)},
title={ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms},
year={2020},
volume={},
number={},
pages={81-92},
keywords={Training;Technological innovation;Navigation;Network topology;Software algorithms;Software;Scheduling;Distributed training;Collective communication;Training parallelism;High performance training systems},
doi={10.1109/ISPASS48437.2020.00018}}
Tutorials
Maintainers
- Tushar Krishna (Georgia Tech)
- Saeed Rashidi (Hewlett Packard)
- Srinivas Sridharan (NVIDIA)
- William Won (Georgia Tech)
- Taekyung Heo (NVIDIA)
- Jinsun Yoo (Georgia Tech)
- Joongun Park (Georgia Tech)
- Changhai Man (Georgia Tech)
- Divya Kiran Kadiyala (Georgia Tech)