Description
What is the problem?
Hello all! So I've just been trying out ray the past week so forgive me if this is a question and not a bug, but I am trying to use the ray cluster launcher to launch a cluster on my schools cluster which I have access to. The trouble (I believe) is that I am a non-root user.
I can launch the cluster on multiple nodes without docker and have verified that this works using some of the provided test python files provided in the documentation. However, on my schools machines, I am a non-root user but part of the docker group so I can run docker commands fine if I prepend sudo
to the docker command. Since docker <command>
gets invoked multiple times by the launcher when setting up the cluster with docker, I added an alias alias docker=sudo docker
to my .bashrc hoping that would work. However it can't get past step 5/7 - Initializing command runner.
Reproduction (REQUIRED)
Here is the .yaml
file I use when I call docker up docker-cluster.yaml
:
cluster_name: docker-cluster
docker:
image: rayproject/ray:latest-cpu
container_name: "ray_cluster_container"
provider:
type: local
head_ip: 10.6.7.3
worker_ips: [10.6.7.1, 10.6.7.5, 10.6.7.8, 10.6.7.7]
auth:
ssh_user: phesse001
min_workers: 4
max_workers: 4
upscaling_speed: 1.0
idle_timeout_minutes: 5
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379# Typically, min_workers == max_workers == len(worker_ips).
and when I run ray up -vvvvvv docker-cluster.yaml
it gets to step 5, and here is the output after step 5 (I can give full output if needed):
[5/7] Initalizing command runner
Running `command -v docker || echo 'NoExist'`
Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (command -v docker || echo '"'"'NoExist'"'"')'`
Shared connection to 10.6.7.3 closed.
Running `docker pull rayproject/ray:latest-cpu`
Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray:latest-cpu)'`
/usr/people/defaults/linuxpaths_bash: line 106: module: command not found
[sudo] password for phesse001:
latest-cpu: Pulling from rayproject/ray
Digest: sha256:e1132e3518d508b0ecc1910cdd61aaf975d66e2fc64d4677fc0d1fbcf5e1122d
Status: Image is up to date for rayproject/ray:latest-cpu
docker.io/rayproject/ray:latest-cpu
Shared connection to 10.6.7.3 closed.
Running `docker inspect -f '{{.State.Running}}' ray_cluster_container || true`
Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_cluster_container || true)'`
Shared connection to 10.6.7.3 closed.
Running `docker inspect -f '{{json .Config.Env}}' rayproject/ray:latest-cpu`
Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_2bce452b53/8b7e00c36c/%C -o ControlPersist=10s -o ConnectTimeout=120s phesse001@10.6.7.3 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{json .Config.Env}}'"'"' rayproject/ray:latest-cpu)'`
Shared connection to 10.6.7.3 closed.
2021-07-08 15:12:20,844 INFO node_provider.py:103 -- ClusterState: Writing cluster state: ['10.6.7.1', '10.6.7.5', '10.6.7.8', '10.6.7.7', '10.6.7.3']
New status: update-failed
!!!
{'message': 'SSH command failed.'}
SSH command failed.
!!!
Failed to setup head node.
After I ssh into one of the nodes and run the command I see it failed on, I get
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/ray_cluster_container/json: dial unix /var/run/docker.sock: connect: permission denied
despite the command sourcing the .bashrc which sets the alias of docker to sudo docker.
I cannot verify this in a clean environment because I am working from within my schools machines which are shared between hundreds of users
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Again this probably isn't a bug, but there wasn't any documentation for my specific scenario of starting a docker cluster as a non-root user, so I was hoping I could clear that up here. Thanks!