permissions on rsync'd files are incorrect on worker nodes, results in inability to update workers

What is the problem?

I can't get a cluster to scale up after launching and using existing nodes. The files fail to sync.

# ray --version
ray, version 1.0.1.post1
# python --version
Python 3.7.7

Reproduction (REQUIRED)

start a cluster with 2 nodes
scale to more, say, 4
when head tries to push files via rsync command runner, some of the files on the worker node are owned by root, instead of ubuntu
this results in rsync error

#  rsync command with -vvv output on
rsync --rsh "ssh -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/f786baef9d/%C -o ControlPersist=10s -o ConnectTimeout=120s" -avz --omit-dir-times --exclude **/.git --exclude **/.git/** --filter "dir-merge,- .gitignore" /project/ ubuntu@172.31.6.63:/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/project/ -vvv

# ...... output last few lines below.....
recv_files(nodes.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/nodes.py.S7lMrv" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
chunk[1] of size 700 at 700 offset=700
chunk[2] of size 700 at 1400 offset=1400
chunk[3] of size 700 at 2100 offset=2100
chunk[4] of size 700 at 2800 offset=2800
chunk[5] of size 60 at 3500 offset=3500
got file_sum
recv_files(src/pipeline.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/pipeline.py.hDutCb" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
rsync: connection unexpectedly closed (318054 bytes received so far) [sender]
[sender] _exit_cleanup(code=12, file=io.c, line=235): entered
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.2]
[sender] _exit_cleanup(code=12, file=io.c, line=235): about to call exit(12)

YAML to reproduce (minus the actual code repo, which I can't share, but shouldn't matter):

min_workers: 2
max_workers: 2

docker:
    image: anyscale/ray-ml:latest-gpu
    container_name: ray_container
    pull_before_run: False

head_node:
    InstanceType: p2.xlarge
    IamInstanceProfile:
        Arn: '<ARN HERE>'
worker_nodes:
    InstanceType: p2.xlarge
    IamInstanceProfile:
        Arn: '<ARN HERE>'

rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"

setup_commands:
    - >-
      git clone git@bitbucket.org/org/project.git /project || true;

target_utilization_fraction: 0.8

head_start_ray_commands:
    - >-
      ray stop;
      ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml;

worker_start_ray_commands:
    - >-
      ray stop;
      ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076;

provider:
    type: aws
    region: ap-southeast-2
    cache_stopped_nodes: true

auth:
    ssh_user: ubuntu

metadata:
    anyscale:
        working_dir: /project

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What is the problem?

Reproduction (REQUIRED)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What is the problem?

Reproduction (REQUIRED)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions