8000 [docker][Clusters][autoscaler][local] Can't connect to cluster when using docker with ray cluster launcher · Issue #16961 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[docker][Clusters][autoscaler][local] Can't connect to cluster when using docker with ray cluster launcher #16961
Open
@phesse001

Description

@phesse001

What is the problem?

Hello all! So I've just been trying out ray the past week so forgive me if this is a question and not a bug, but I am trying to use the ray cluster launcher to launch a cluster on my schools cluster which I have access to. The trouble (I believe) is that I am a non-root user.

I can launch the cluster on multiple nodes without docker and have verified that this works using some of the provided test python files provided in the documentation. However, on my schools machines, I am a non-root user but part of the docker group so I can run docker commands fine if I prepend sudo to the docker command. Since docker <command> gets invoked multiple times by the launcher when setting up the cluster with docker, I added an alias alias docker=sudo docker to my .bashrc hoping that would work. However it can't get past step 5/7 - Initializing command runner.

Reproduction (REQUIRED)

Here is the .yaml file I use when I call docker up docker-cluster.yaml:

cluster_name: docker-cluster 

docker:
     image: rayproject/ray:latest-cpu 
     container_name: "ray_cluster_container"

provider:
    type: local
    head_ip: 10.6.7.3
    worker_ips: [10.6.7.1, 10.6.7.5, 10.6.7.8, 10.6.7.7]
    
auth:
    ssh_user: phesse001

min_workers: 4
max_workers: 4
upscaling_speed: 1.0
idle_timeout_minutes: 5

cluster_synced_files: []

file_mounts_sync_continuously: False

rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

initialization_commands: []

setup_commands: []

head_setup_commands: []

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379# Typically, min_workers == max_workers == len(worker_ips).

and when I run ray up -vvvvvv docker-cluster.yaml it gets to step 5, and here is the output after step 5 (I can give full output if needed):

[5/7] Initalizing command runner
    Running `command -v docker || echo 'NoExist'`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (command -v docker || echo '"'"'NoExist'"'"')'`
Shared connection to 10.6.7.3 closed.
    Running `docker pull rayproject/ray:latest-cpu`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray:latest-cpu)'`
/usr/people/defaults/linuxpaths_bash: line 106: module: command not found
[sudo] password for phesse001: 
latest-cpu: Pulling from rayproject/ray
Digest: sha256:e1132e3518d508b0ecc1910cdd61aaf975d66e2fc64d4677fc0d1fbcf5e1122d
Status: Image is up to date for rayproject/ray:latest-cpu
docker.io/rayproject/ray:latest-cpu
Shared connection to 10.6.7.3 closed.
    Running `docker inspect -f '{{.State.Running}}' ray_cluster_container || true`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_cluster_container || true)'`
Shared connection to 10.6.7.3 closed.
    Running `docker inspect -f '{{json .Config.Env}}' rayproject/ray:latest-cpu`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{json .Config.Env}}'"'"' rayproject/ray:latest-cpu)'`
Shared connection to 10.6.7.3 closed.
2021-07-08 15:12:20,844	INFO node_provider.py:103 -- ClusterState: Writing cluster state: ['10.6.7.1', '10.6.7.5', '10.6.7.8', '10.6.7.7', '10.6.7.3']
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!
  
  Failed to setup head node.

After I ssh into one of the nodes and run the command I see it failed on, I get

Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/ray_cluster_container/json: dial unix /var/run/docker.sock: connect: permission denied

despite the command sourcing the .bashrc which sets the alias of docker to sudo docker.

I cannot verify this in a clean environment because I am working from within my schools machines which are shared between hundreds of users

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Again this probably isn't a bug, but there wasn't any documentation for my specific scenario of starting a docker cluster as a non-root user, so I was hoping I could clear that up here. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tinfraautoscaler, ray client, kuberay, related issuespending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0