Open
Description
Describe the bug
Errror:
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
The error goes away with NCCL_SHM_DISABLE=1
.
Steps/Code to reproduce bug
uv run pytest tests/unit/models/generation/test_vllm_generation.py::test_vllm_refit_non_collocated_handles_update
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider - AWS, Azure, GCP, Collab)]
- Method of install: [pip install or from source]. Please specify exact commands you used to install.
- If method of install is [Docker], provide
docker pull
&docker run
commands used
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
- OS version
- PyTorch version
- Python version
Additional context
Add any other context about the problem here.
Example: GPU model