8000 feature: hybrid mode with ring allreduce by myungjin · Pull Request #170 · cisco-open/flame · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feature: hybrid mode with ring allreduce #170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 8, 2022

Conversation

myungjin
Copy link
Contributor
@myungjin myungjin commented Jul 7, 2022

Hybrid mode of combining distributed learning and federated learning
is implemented. This implementation is based on PR #131.

The current implementation has a bug thta can cause a deadlock when
trainers arrive late. Addressing the issue will be handled as a
separate PR as it requires quite changes in backend, channel, channel
manager, etc.

Hybrid mode of combining distributed learning and federated learning
is implemented. This implementation is based on PR cisco-open#131.

The current implementation has a bug thta can cause a deadlock when
trainers arrive late. Addressing the issue will be handled as a
separate PR as it requires quite changes in backend, channel, channel
manager, etc.
#
# non-committers send a dummy message so that the aggregator won't
# be blocked.
# TODO: figure out a way not to send a dummy message
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't line 105 already a way to send the dummy message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better if there is a way to skip a dummy message.

Copy link
Collaborator
@GaoxiangLuo GaoxiangLuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. The implementation is consistent with the previous PR.

When I was testing it:

  1. If a trainer of us/org is already running with the aggregator, then when a new trainer of us/XXX joins, it will cause a deadlock.
  2. However, when a new trainer of eu/XXX joins, it won't cause a deadlock if there are not existing eu/XXX trainers.

In a word, a new trainer will only be hanging if there is an existing trainer of the same realm who may potentially form a ring with the new trainer. The new trainer is blocked on the stage of fetching weights from the aggregator. Hence, it may have something to do with how to distinguish aggregator-trainer channel and trainer-trainer channel. This will be resolved as a separate PR.

@myungjin myungjin merged commit 53594f9 into cisco-open:main Jul 8, 2022
@myungjin myungjin deleted the ring_reduce_hybrid branch July 8, 2022 16:13
dhruvsgarg pushed a commit to dhruvsgarg/flame that referenced this pull request Oct 18, 2024
Hybrid mode of combining distributed learning and federated learning
is implemented. This implementation is based on PR cisco-open#131.

The current implementation has a bug thta can cause a deadlock when
trainers arrive late. Addressing the issue will be handled as a
separate PR as it requires quite changes in backend, channel, channel
manager, etc.

Co-authored-by: Gaoxiang Luo <luo00042@umn.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0