8000 NCCL error when using non-colocated generation and set_model_state_dict apis · Issue #564 · NVIDIA-NeMo/RL · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
NCCL error when using non-colocated generation and set_model_state_dict apis #564
Open
@parthchadha

Description

@parthchadha

Describe the bug

Errror:

raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)

The error goes away with NCCL_SHM_DISABLE=1.

Steps/Code to reproduce bug

uv run pytest tests/unit/models/generation/test_vllm_generation.py::test_vllm_refit_non_collocated_handles_update

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
  • Method of install: [pip install or from source]. Please specify exact commands you used to install.
  • If method of install is [Docker], provide docker pull & docker run commands used

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

  • OS version
  • PyTorch version
  • Python version

Additional context

Add any other context about the problem here.
Example: GPU model

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0