[Ray Core] Ray nightly GPU docker image broken on NVIDIA V100 GPUs on AWS

What happened + What you expected to happen

This template works on NVIDIA A10 GPUs on AWS (g5.xlarge instances): https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
Changing 1 line in this configuration (instance type: g5.xlarge to p3.2xlarge) leads to this command failing: python -c "import torch; print(torch.cuda.is_available())" (on g5.xlarge instances it runs fine).

The base AMI is the NVIDIA AMI (ami-041855406987a648b).

SSHing into the node directly and running python -c "import torch; print(torch.cuda.is_available())" works fine, which demonstrates the problem is isolated to the ray docker image.

So far all of these fail:

rayproject/ray-ml:latest-gpu
rayproject/ray:nightly-py310-cu121
rayproject/ray-ml:latest-py39-cu118

Any ideas for debugging this? Thank you!!

Versions / Dependencies

rayproject/ray-ml:latest-py39-cu118
rayproject/ray:nightly-py310-cu121
rayproject/ray-ml:latest-gpu

p3.2xlarge and g5.xlarge instances on AWS with https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml.

Reproduction script

ray up https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
python -c "import torch; print(torch.cuda.is_available())"

Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions