8000 Training a model with tied weights with FSDP1 on multiple GPUs doesn't raise an error · Issue #264 · NVIDIA-NeMo/RL · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Training a model with tied weights with FSDP1 on multiple GPUs doesn't raise an error #264
Closed
@yfw

Description

@yfw

For example, the following trains without raising an error:

uv run examples/run_grpo_math.py --config=examples/configs/grpo_math_1B.yaml grpo.val_at_start=False checkpointing.enabled=False logger.wandb_enabled=False cluster.gpus_per_node=2 policy.model_name=Qwen/Qwen2.5-1.5B  policy.dtensor_cfg.enabled=False

Changing the gpus_per_node to 1 raises the error correctly:

uv run examples/run_grpo_math.py --config=examples/configs/grpo_math_1B.yaml grpo.val_at_start=False checkpointing.enabled=False logger.wandb_enabled=False cluster.gpus_per_node=1 policy.model_name=Qwen/Qwen2.5-1.5B  policy.dtensor_cfg.enabled=False

This seems to be because find_tied_parameters doesn't work correctly with FSDP models. From my testing, transformers.modeling_utils._get_tied_weight_keys seems to work correctly.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0