Open
Description
What is the problem?
I can't get a cluster to scale up after launching and using existing nodes. The files fail to sync.
# ray --version
ray, version 1.0.1.post1
# python --version
Python 3.7.7
Reproduction (REQUIRED)
- start a cluster with 2 nodes
- scale to more, say, 4
- when head tries to push files via rsync command runner, some of the files on the worker node are owned by
root
, instead ofubuntu
- this results in rsync error
# rsync command with -vvv output on
rsync --rsh "ssh -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/f786baef9d/%C -o ControlPersist=10s -o ConnectTimeout=120s" -avz --omit-dir-times --exclude **/.git --exclude **/.git/** --filter "dir-merge,- .gitignore" /project/ ubuntu@172.31.6.63:/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/project/ -vvv
# ...... output last few lines below.....
recv_files(nodes.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/nodes.py.S7lMrv" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
chunk[1] of size 700 at 700 offset=700
chunk[2] of size 700 at 1400 offset=1400
chunk[3] of size 700 at 2100 offset=2100
chunk[4] of size 700 at 2800 offset=2800
chunk[5] of size 60 at 3500 offset=3500
got file_sum
recv_files(src/pipeline.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/pipeline.py.hDutCb" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
rsync: connection unexpectedly closed (318054 bytes received so far) [sender]
[sender] _exit_cleanup(code=12, file=io.c, line=235): entered
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.2]
[sender] _exit_cleanup(code=12, file=io.c, line=235): about to call exit(12)
YAML to reproduce (minus the actual code repo, which I can't share, but shouldn't matter):
min_workers: 2
max_workers: 2
docker:
image: anyscale/ray-ml:latest-gpu
container_name: ray_container
pull_before_run: False
head_node:
InstanceType: p2.xlarge
IamInstanceProfile:
Arn: '<ARN HERE>'
worker_nodes:
InstanceType: p2.xlarge
IamInstanceProfile:
Arn: '<ARN HERE>'
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
setup_commands:
- >-
git clone git@bitbucket.org/org/project.git /project || true;
target_utilization_fraction: 0.8
head_start_ray_commands:
- >-
ray stop;
ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml;
worker_start_ray_commands:
- >-
ray stop;
ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076;
provider:
type: aws
region: ap-southeast-2
cache_stopped_nodes: true
auth:
ssh_user: ubuntu
metadata:
anyscale:
working_dir: /project
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.