8000 Support MoE models in FSDP2 · Issue #413 · NVIDIA/NeMo-RL · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Support MoE models in FSDP2 #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yuki-666 opened this issue May 19, 2025 · 0 comments
Open

Support MoE models in FSDP2 #413

yuki-666 opened this issue May 19, 2025 · 0 comments
Labels
enhancement New feature or request new model

Comments

@yuki-666
Copy link
Collaborator

Currently general FSDP2 for non-MoE models all run well, but not run well with MoE models. (e.g., Qwen3-30B-A3B, DeepSeek-V2-Lite)

  1. for Qwen3-30B-A3B, it is obviously slower than Qwen3-32B, especially on the refit process or using hf-tp-plan with dtensor tp > 1.
    Image

  2. for DeepSeek-V2-Lite, fail on the following error on model.layers.0.self_attn.rotary_emb.cos_cached , said v.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64])

File "/workspace/nemo_rl/models/policy/dtensor_policy_worker.py", line 649, in get_reference_policy_logprobs
  with self.use_reference_model():
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yukih/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 137, in __enter__
  return next(self.gen)
         ^^^^^^^^^^^^^^
File "/workspacenemo_rl/models/policy/dtensor_policy_worker.py", line 626, in use_reference_model
  val.copy_(self.reference_model_buffers[k])
RuntimeError: The size of tensor a (2048) must match the size of tensor b (163840) at non-singleton dimension 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request new model
Projects < 38B1 /div>
None yet
Development

No branches or pull requests

2 participants
0