KINGHQ

It's a graduation project of Haiquan Wang.

Background

Nowadays, machine learning is getting more and more attention, including but not limited to computer vision and natural language process. However, machine learning is a time-costing and resource-costing task for large model and dataset. Under this circumstances, distributed machine learning becomes a necessary choice. Besides the raw distributed strategy inserted in the popular frameworks, for example, pytorch, there are other optimization distributed strategy also built on the current frameworks, such as horovod and byteps. However, they are mainly designed for hard-synchronization(BSP), which means a worker has to wait for others to complete communication in every iteration. Many computation resources are wasted during waiting, especailly in the bad network situation. Therefore, some users may try to release the synchronization to achieve speedup. For example, the worker will not be blocked if it doesn't exceed the slowest one over specified steps(SSP). In addition, some researcher also want to release the synchronization to see the performance of different models in large distributed environment.

Motivated by what is aforementioned, I design a flexible enough framework, in which users can implement various synchronization strategies while they only need to modify little code. Besides, based on this framework, I also implement many experiments to study the performance of DNN models response to various synchronization strategies, like ASP and SSP. And I try to find some principles that can inspire us what synchronization strategy is the best for specified DNN model.

Getting Started Please build pytorch from the source.

./run.sh There is a simple launch command in ./run.sh. And usr can config the worker host and server host in ./config/host. for example, in server hostfile: localhost localhost And it will launch two servers in the localhost.

Name		Name	Last commit date
Latest commit History 144 Commits
.vscode		.vscode
__pycache__		__pycache__
benchmark		benchmark
bug issue		bug issue
config		config
core		core
figure		figure
log		log
models		models
msg		msg
role		role
script		script
strategy		strategy
test		test
usrJSON		usrJSON
utils		utils
CONFIG.py		CONFIG.py
README.md		README.md
__init__.py		__init__.py
demo.py		demo.py
demo_single.py		demo_single.py
run.sh		run.sh
strategy.json		strategy.json
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KINGHQ

Background

About

Uh oh!

Releases

Packages

Languages

1926627357/KINGHQ

Folders and files

Latest commit

History

Repository files navigation

KINGHQ

Background

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages