8000 ray.init() can sometimes hang with a limited range specified for --worker-port-list · Issue #40497 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
ray.init() can sometimes hang with a limited range specified for --worker-port-list #40497
Open
@amohar2

Description

@amohar2

What happened + What you expected to happen

I am using Ray on a system that has limited number of ports, hence the ports used by all the components should be specified beforehand.
I am sometime seeing random hangs upon ray.init() calls, and cannot figure out what causes this.

My guess is that since Ray starts all workers concurrently they all compete for a pool of ports and may end up in a deadlock somehow? However, this theory may not be completely correct since based on this PR I tried setting RAY_worker_maximum_startup_concurrency to 1 so that workers startup sequentially and don't compete for ports, however, I can still see the hang from time to time.
Another cause of this issue might be due to a specified port not becoming immediately available (for example if a previous worker is not terminated properly), hence blocking the subsequent ray.init() call due to unavailable ports. I noticed that in worker_pool.cc Here there is a loop checking for available ports only once and if there are none it will return an error, but shouldn't there be several retries for this check, in case the ports become available later?
Also, ideally ray.init() shouldn't hang and at least error out so that the entire program doesn't freeze.

I have also tried providing a wider range of worker ports and while this mitigates the issue, it is not a viable solution.

I have a script that can reproduce the issue, however the hit ratio is very low, the script has certain characteristics to mimic what I do in realtime environment:

  1. starts Ray cluster through CLI, with 1 Ray worker, and worker port list range of 2, since Ray needs another port during ray.init() for the driver
  2. does ray.init() --> this can hang randomly
  3. submits a remote function and gather the result
  4. shuts down the cluster through ray stop CLI

On rare occasions that the hang happens I see the following messages in driver/worker logs:

Driver:
[2023-10-19 16:03:26,009 I 24773 24773] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 24773
[2023-10-19 16:03:26,011 I 24773 24773] io_service_pool.cc:35: IOServicePool is running with 1 io_service.

Worker:
[2023-10-19 16:03:26,395 I 25497 25497] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 25497
[2023-10-19 16:03:26,397 I 25497 25497] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-10-19 16:03:26,398 E 25497 25497] core_worker.cc:203: Failed to register worker 9ef8d266e2574623dcc2c7750543397e8d74ef9d327b2716602d743c to Raylet. Invalid: Invalid: No available ports. Please specify a wider port range using --min-worker-port and --max-worker-port.

Versions / Dependencies

ray[client] == 2.7.1

Reproduction script

import ray
import subprocess
import os
import sys
import numpy as np

def np_mat(t):
    x = np.random.rand(t,t)
    return x.nbytes

startup_cmd = "ray stop --force; ray start --head --port 13041 --worker-port-list 13048,13049 --ray-client-server-port 13055 --node-manager-port 13053 --object-manager-port 13054 --runtime-env-agent-port 13056 --num-cpus 1 --include-dashboard False --disable-usage-stats --temp-dir=/dev/shm/ray --object-store-memory=50000000000"

ENV = os.environ.copy()
python_bin_dir = "/".join(sys.executable.split('/')[:-1])
ENV["PATH"] = f"{python_bin_dir}:{ENV['PATH']}"

for i in range(1000):
    process = subprocess.run(startup_cmd, capture_output=True, shell=True, env=ENV)
    if process.returncode:
        print(process.stderr)
        print(process.stdout)
    print(f"{i+1}: Ray start completed")
    if not ray.is_initialized():
        print(f"{i+1}: Ray Driver not started")
        ray.init(
            address="127.0.0.1:13041",
            _temp_dir="/dev/shm/ray",
        )
    print(f"{i+1}: Ray Driver started")
    remote_func = ray.remote(np_mat)
    # remote_func = remote_func.options(num_cpus=1)
    future = remote_func.remote(t=(30000+np.random.randint(1,100)))
    result = ray.get(future)
    print(f"{i+1}: Ray remote job completed, array size {result / 2**30} GB")
    ray.shutdown()
    shutdown_cmd = f"ray stop --force"
    process = subprocess.run(shutdown_cmd, capture_output=True, shell=True, env=ENV)
    print(f"{i+1}: Ray shutdown completed")

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Issue moderate in impact or severitybugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corepending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0