8000 fix: zero weight upload from trainer in hybrid mode by jaemin-shin · Pull Request #389 · cisco-open/flame · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: zero weight upload from trainer in hybrid mode #389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 3, 2023

Conversation

jaemin-shin
Copy link
Collaborator

remove _update_weights() call from _upload_weights() function before calculating the delta weights, as such a function call makes self.prev_weights and self.weights equal, resulting in delta_weights to be all zero

Description

Please provide a meaningful description of what this change will do, or is for. Bonus points for including links to
related issues, other PRs, or technical references.

Note that by not including a description, you are asking reviewers to do extra work to understand the context of this
change, which may lead to your PR taking much longer to review, or result in it not being reviewed at all.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@jaemin-shin jaemin-shin requested a review from myungjin April 3, 2023 20:20
@@ -109,8 +109,6 @@ def _upload_weights(self, tag: str) -> None:
# one aggregator is sufficient
end = channel.one_end()

self._update_weights()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice catch! Thanks!
But this needs some additional change: I think it should be moved to the statement "if not self.can_ring_allreduce():".
If all reduce was executed, this self._update_weights() is problematic because it was already called.
If all reduce was not executed, removing this part means that prev_weights was not updated. To handle this case, I think self._update_weights() needs to be inside the if-statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are correct! Should have catched this before I make a PR. Just updated the PR!

remove _update_weights() call from _upload_weights() function before
calculating the delta weights, as such a function call makes
self.prev_weights and self.weights equal, resulting in delta_weights to
be all zero
@jaemin-shin jaemin-shin force-pushed the hybrid-zerodelta-fix branch from 171bfde to 8a57690 Compare April 3, 2023 21:41
Copy link
Contributor
@myungjin myungjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jaemin-shin jaemin-shin merged commit c71ffc0 into cisco-open:main Apr 3, 2023
dhruvsgarg pushed a commit to dhruvsgarg/flame that referenced this pull request Oct 18, 2024
remove _update_weights() call from _upload_weights() function before
calculating the delta weights, as such a function call makes
self.prev_weights and self.weights equal, resulting in delta_weights to
be all zero
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0