8000 permissions on rsync'd files are incorrect on worker nodes, results in inability to update workers · Issue #12630 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
permissions on rsync'd files are incorrect on worker nodes, results in inability to update workers #12630
Open
@worldveil

Description

@worldveil

What is the problem?

I can't get a cluster to scale up after launching and using existing nodes. The files fail to sync.

# ray --version
ray, version 1.0.1.post1
# python --version
Python 3.7.7

Reproduction (REQUIRED)

  1. start a cluster with 2 nodes
  2. scale to more, say, 4
  3. when head tries to push files via rsync command runner, some of the files on the worker node are owned by root, instead of ubuntu
  4. this results in rsync error
#  rsync command with -vvv output on
rsync --rsh "ssh -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/f786baef9d/%C -o ControlPersist=10s -o ConnectTimeout=120s" -avz --omit-dir-times --exclude **/.git --exclude **/.git/** --filter "dir-merge,- .gitignore" /project/ ubuntu@172.31.6.63:/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/project/ -vvv

# ...... output last few lines below.....
recv_files(nodes.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/nodes.py.S7lMrv" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
chunk[1] of size 700 at 700 offset=700
chunk[2] of size 700 at 1400 offset=1400
chunk[3] of size 700 at 2100 offset=2100
chunk[4] of size 700 at 2800 offset=2800
chunk[5] of size 60 at 3500 offset=3500
got file_sum
recv_files(src/pipeline.py)
rsync: mkstemp "/tmp/ray_tmp_mount/anyscale-dev-prod-aedb54c828929615/pipeline.py.hDutCb" failed: Permission denied (13)
chunk[0] of size 700 at 0 offset=0
rsync: connection unexpectedly closed (318054 bytes received so far) [sender]
[sender] _exit_cleanup(code=12, file=io.c, line=235): entered
rsync error: error in rsync protocol data stream (code 12) at io.c(235) [sender=3.1.2]
[sender] _exit_cleanup(code=12, file=io.c, line=235): about to call exit(12)

YAML to reproduce (minus the actual code repo, which I can't share, but shouldn't matter):

min_workers: 2
max_workers: 2

docker:
    image: anyscale/ray-ml:latest-gpu
    container_name: ray_container
    pull_before_run: False

head_node:
    InstanceType: p2.xlarge
    IamInstanceProfile:
        Arn: '<ARN HERE>'
worker_nodes:
    InstanceType: p2.xlarge
    IamInstanceProfile:
        Arn: '<ARN HERE>'

rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"

setup_commands:
    - >-
      git clone git@bitbucket.org/org/project.git /project || true;

target_utilization_fraction: 0.8

head_start_ray_commands:
    - >-
      ray stop;
      ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml;

worker_start_ray_commands:
    - >-
      ray stop;
      ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076;

provider:
    type: aws
    region: ap-southeast-2
    cache_stopped_nodes: true

auth:
    ssh_user: ubuntu

metadata:
    anyscale:
        working_dir: /project
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tinfraautoscaler, ray client, kuberay, related issuespending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0