Overview

ASTRA-sim is a distributed machine learning system simulator. It enables the systematic study of challenges in modern deep learning systems, allowing for the exploration of bottlenecks and the development of efficient methodologies for large DNN models across diverse future platforms. Using ASTRA-sim’s APIs, you can plug-and-play with any network, compute, or memory simulator backends.

Below is a concise visual summary of our simulator:

Contact Us

Useful Links

Documentation

  • For information on how to use ASTRA-sim, please visit our Wiki.
  • For Chakra MLCommons working group, please visit here.

GitHub Repositories

Papers

The full description of the tool and its strength can be found in these papers.

@INPROCEEDINGS{won2023astrasim2,
  author={Won, William and Heo, Taekyung and Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
  booktitle={2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, 
  title={ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale}, 
  year={2023},
  volume={},
  number={},
  pages={283-294},
  keywords={Training;Semiconductor device modeling;Analytical models;Network topology;Systems modeling;Throughput;Data models;Distributed training;High-performance training;Multi-dimensional network;Disaggregated memory system},
  doi={10.1109/ISPASS57527.2023.00035}}
@INPROCEEDINGS{rashidi2020astrasim,
  author={Rashidi, Saeed and Sridharan, Srinivas and Srinivasan, Sudarshan and Krishna, Tushar},
  booktitle={2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)}, 
  title={ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms}, 
  year={2020},
  volume={},
  number={},
  pages={81-92},
  keywords={Training;Technological innovation;Navigation;Network topology;Software algorithms;Software;Scheduling;Distributed training;Collective communication;Training parallelism;High performance training systems},
  doi={10.1109/ISPASS48437.2020.00018}}

Tutorials

Hits

Maintainers

  • Tushar Krishna (Georgia Tech)
  • Saeed Rashidi (Hewlett Packard)
  • Srinivas Sridharan (NVIDIA)
  • William Won (Georgia Tech)
  • Taekyung Heo (NVIDIA)
  • Jinsun Yoo (Georgia Tech)
  • Joongun Park (Georgia Tech)
  • Changhai Man (Georgia Tech)
  • Divya Kiran Kadiyala (Georgia Tech)