This is the ASTRA-sim distributed Deep Learning Training simulator, developed in collaboration between Georgia Tech, Facebook and Intel.
An overview is presented here:
The full description of the tool and its strength can be found in the paper below:
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, "ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms" In Proc of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2020 [pdf][slides][video]
Bibtex
@inproceedings{astrasim,
author = {Saeed Rashidi and
Srinivas Sridharan and
Sudarshan Srinivasan and
Tushar Krishna},
title = {{ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms}},
booktitle = {{IEEE} International Symposium on Performance Analysis of Systems
and Software, {ISPASS} 2020, Boston, MA, USA, August 22-26, 2020},
publisher = {{IEEE}},
year = {2020},
}
# Clone the repository
$ git clone https://github.com/astra-sim/astra-sim.git
# Build the repository
# - You will be asked what network backend to download: gem5 or ns3. It will be cloned into the network folder.
# - The SCALE-Sim compute model will be cloned to the compute folder
$ cd astra-sim
$ ./build.sh
- Enter gem5 when prompted by build.sh. This will clone the https://github.com/georgia-tech-synergy-lab/gem5_astra repository inside the astra-sim/network folder
- Go to astra-sim/network/gem5_astra
- Run: "./my_scripts/build_Garnet_standalone.sh" to build the garnet network simulator.
- Go to astra_runs/ directory
- Run: "./sampleDLRM_a2a.sh runName" where runName is an arbitrary name you choose for this specific run. This will run a toy DLRM model over a physical alltoall topology.
- After the sim finishes, go to ../astra_results/runName-a2a directory, the detailed and EndToEnd csv files for this run are stored there (times are in microseconds in the csv files)
NOTE: you can also run "./sampleDLRM_torus.sh runName" for step 5 instead. This will run a toy DLRM model over a physical 3D torus.
NOTE: The on-screen reported delays after the end of simulation are in cycles while the delays inside the csv files are in terms of microSeconds.
Coming Soon!
- Workload: astra-sim/workload/workload_inputs/
- see workload_inputs/README.md
- see workload_generator/README.md
- System: astra-sim/system/system_inputs/
- see system_inputs/README.md
- Network: astra-sim/network/gem5_astra/network_inputs/
- see network_inputs/README.md
Please email Saeed Rashidi (saeed.rashidi@gatech.edu) or Srinivas Sridharan (ssrinivas@fb.com) or Tushar Krishna (tushar@ece.gatech.edu) if you have any questions.
- Saeed Rashidi (Georgia Tech)
- Srinivas Sridharan (Facebook)
- Jiayi Huang (University of California, Santa Barbara)
- Apurve Chawde (Georgia Tech)
- Santosh Kumar Elangoven (Georgia Tech)
- William Won (Georgia Tech)
- Tushar Krishna (Georgia Tech)