Fix overlap param gather + distributed optimizer in Megatron path

Describe the bug
When using the Megatron backend, distributed optimizer with overlap param gather causes divergence for some algorithms. Because of this, we've blocked users from running with overlap_param_gather + distributed optimizer. We need to fix this bug to get the maximum performance from Megatron.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
Method of install: [pip install or from source]. Please specify exact commands you used to install.
If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version
PyTorch version
Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions