Open
Description
What happened + What you expected to happen
- This template works on NVIDIA A10 GPUs on AWS (g5.xlarge instances): https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
- Changing 1 line in this configuration (instance type: g5.xlarge to p3.2xlarge) leads to this command failing:
python -c "import torch; print(torch.cuda.is_available())"
(on g5.xlarge instances it runs fine).
The base AMI is the NVIDIA AMI (ami-041855406987a648b).
SSHing into the node directly and running python -c "import torch; print(torch.cuda.is_available())"
works fine, which demonstrates the problem is isolated to the ray docker image.
So far all of these fail:
rayproject/ray-ml:latest-gpu
rayproject/ray:nightly-py310-cu121
rayproject/ray-ml:latest-py39-cu118
Any ideas for debugging this? Thank you!!
Versions / Dependencies
rayproject/ray-ml:latest-py39-cu118
rayproject/ray:nightly-py310-cu121
rayproject/ray-ml:latest-gpu
p3.2xlarge and g5.xlarge instances on AWS with https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml.
Reproduction script
ray up https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
python -c "import torch; print(torch.cuda.is_available())"
Issue Severity
High: It blocks me from completing my task.