feature: hybrid mode with ring allreduce #170

myungjin · 2022-07-07T23:15:42Z

Hybrid mode of combining distributed learning and federated learning
is implemented. This implementation is based on PR #131.

The current implementation has a bug thta can cause a deadlock when
trainers arrive late. Addressing the issue will be handled as a
separate PR as it requires quite changes in backend, channel, channel
manager, etc.

Hybrid mode of combining distributed learning and federated learning is implemented. This implementation is based on PR cisco-open#131. The current implementation has a bug thta can cause a deadlock when trainers arrive late. Addressing the issue will be handled as a separate PR as it requires quite changes in backend, channel, channel manager, etc.

GaoxiangLuo · 2022-07-08T14:04:57Z

lib/python/flame/mode/hybrid/trainer.py

+        #
+        # non-committers send a dummy message so that the aggregator won't
+        # be blocked.
+        # TODO: figure out a way not to send a dummy message


Isn't line 105 already a way to send the dummy message?

It's better if there is a way to skip a dummy message.

GaoxiangLuo

lgtm. The implementation is consistent with the previous PR.

When I was testing it:

If a trainer of us/org is already running with the aggregator, then when a new trainer of us/XXX joins, it will cause a deadlock.
However, when a new trainer of eu/XXX joins, it won't cause a deadlock if there are not existing eu/XXX trainers.

In a word, a new trainer will only be hanging if there is an existing trainer of the same realm who may potentially form a ring with the new trainer. The new trainer is blocked on the stage of fetching weights from the aggregator. Hence, it may have something to do with how to distinguish aggregator-trainer channel and trainer-trainer channel. This will be resolved as a separate PR.

Hybrid mode of combining distributed learning and federated learning is implemented. This implementation is based on PR cisco-open#131. The current implementation has a bug thta can cause a deadlock when trainers arrive late. Addressing the issue will be handled as a separate PR as it requires quite changes in backend, channel, channel manager, etc. Co-authored-by: Gaoxiang Luo <luo00042@umn.edu>

myungjin requested a review from GaoxiangLuo July 7, 2022 23:15

myungjin mentioned this pull request Jul 7, 2022

feature: hybrid mode with ring allreduce #131

Closed

GaoxiangLuo reviewed Jul 8, 2022

View reviewed changes

GaoxiangLuo approved these changes Jul 8, 2022

View reviewed changes

myungjin merged commit 53594f9 into cisco-open:main Jul 8, 2022

myungjin deleted the ring_reduce_hybrid branch July 8, 2022 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: hybrid mode with ring allreduce #170

feature: hybrid mode with ring allreduce #170

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feature: hybrid mode with ring allreduce #170

feature: hybrid mode with ring allreduce #170

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!