Support MoE models in FSDP2 #413

yuki-666 · 2025-05-19T14:49:22Z

Currently general FSDP2 for non-MoE models all run well, but not run well with MoE models. (e.g., Qwen3-30B-A3B, DeepSeek-V2-Lite)

for Qwen3-30B-A3B, it is obviously slower than Qwen3-32B, especially on the refit process or using hf-tp-plan with dtensor tp > 1.
for DeepSeek-V2-Lite, fail on the following error on model.layers.0.self_attn.rotary_emb.cos_cached , said v.shape=torch.Size([2048, 64]) and self.reference_model_buffers[k].shape=torch.Size([163840, 64])

File "/workspace/nemo_rl/models/policy/dtensor_policy_worker.py", line 649, in get_reference_policy_logprobs
  with self.use_reference_model():
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yukih/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 137, in __enter__
  return next(self.gen)
         ^^^^^^^^^^^^^^
File "/workspacenemo_rl/models/policy/dtensor_policy_worker.py", line 626, in use_reference_model
  val.copy_(self.reference_model_buffers[k])
RuntimeError: The size of tensor a (2048) must match the size of tensor b (163840) at non-singleton dimension 0

The text was updated successfully, but these errors were encountered:

yuki-666 mentioned this issue May 19, 2025

feat: general fsdp2 on non-MoE models + HF TP plan #352

Merged

4 tasks

parthchadha added enhancement New feature or request new model labels May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support MoE models in FSDP2 #413

Support MoE models in FSDP2 #413

Support MoE models in FSDP2 #413

Support MoE models in FSDP2 #413

Comments