Closed
Description
What happened + What you expected to happen
I want to run a workload on vLLM where I use the placement groups to create and schedule the vLLM task on a GPU node and use the placement_group_capture_child_tasks
flag to then tell the vLLM task to use the appropriate GPU placement group. However, I noticed that the placement groups don't set the CUDA_VISIBLE_DEVICES correctly. Ray simply sets the field empty instead of mapping it to an appropriate physical device id.
Versions / Dependencies
ray==2.46.0
Reproduction script
Repro code:
import os
from ray.util.placement_group import placement_group
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
import ray
ray.init(num_gpus=1, num_cpus=1)
@ray.remote(
scheduling_strategy=PlacementGroupSchedulingStrategy(
placement_group=placement_group(
[{"GPU": 1, "CPU": 1}] * 1,
strategy="STRICT_PACK",
),
placement_group_capture_child_tasks=True,
),
)
def f():
print(f"CUDA_VISIBLE_DEVICES in environment: {os.environ['CUDA_VISIBLE_DEVICES']}")
print(f"CUDA_VISIBLE_DEVICES is empty: {os.environ.get('CUDA_VISIBLE_DEVICES') == ''}")
ray.get(f.remote())
Output from code:
(marin) cychou@sphinx3:/nlp/scr/cychou/marin$ python experiments/sched_strategy.py
2025-06-07 20:13:21,052 INFO worker.py:1879 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
(f pid=481782) CUDA_VISIBLE_DEVICES in environment:
(f pid=481782) CUDA_VISIBLE_DEVICES is empty: True
nvidia-smi output:
(marin) cychou@sphinx3:/nlp/scr/cychou/marin$ nvidia-smi
Sat Jun 7 20:14:24 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:44:00.0 Off | 0 |
| N/A 24C P0 60W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
Issue Severity
I have to work around it by del os.environ['CUDA_VISIBLE_DEVICES']
before running the task