8000 [Ray Core] Ray nightly GPU docker image broken on NVIDIA V100 GPUs on AWS · Issue #43565 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Ray Core] Ray nightly GPU docker image broken on NVIDIA V100 GPUs on AWS #43565
Open
@jaanphare

Description

@jaanphare

What happened + What you expected to happen

  1. This template works on NVIDIA A10 GPUs on AWS (g5.xlarge instances): https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
  2. Changing 1 line in this configuration (instance type: g5.xlarge to p3.2xlarge) leads to this command failing: python -c "import torch; print(torch.cuda.is_available())" (on g5.xlarge instances it runs fine).

The base AMI is the NVIDIA AMI (ami-041855406987a648b).

SSHing into the node directly and running python -c "import torch; print(torch.cuda.is_available())" works fine, which demonstrates the problem is isolated to the ray docker image.

So far all of these fail:

rayproject/ray-ml:latest-gpu
rayproject/ray:nightly-py310-cu121
rayproject/ray-ml:latest-py39-cu118

Any ideas for debugging this? Thank you!!

Versions / Dependencies

rayproject/ray-ml:latest-py39-cu118
rayproject/ray:nightly-py310-cu121
rayproject/ray-ml:latest-gpu

p3.2xlarge and g5.xlarge instances on AWS with https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml.

Reproduction script

ray up https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml
python -c "import torch; print(torch.cuda.is_available())"

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P3Issue moderate in impact or severitybugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corecore-clustersFor launching and managing Ray clusters/jobs/kubernetespending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.usability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0