Description
What happened + What you expected to happen
In my system, there are specific requirements to use a handful of pre-determined ports for Ray components. Therefore, I rely on min/max-worker-port arguments to provide a narrow range of ports for "ray start".
Here is a CLI that I use to start ray:
"ray stop; ray stop --force; ray start --head --port 8786 --min-worker-port 8787 --max-worker-port 8788 --ray-client-server-port 8789 --node-manager-port 8790 --object-manager-port 8791 --num-cpus 2 --include-dashboard True --disable-usage-stats;"
"ray start" command completes successfully.
However, I am seeing the following issues, when running parallel ray.remote() jobs. Each job is a separate model.train(), where model is one of the scikit-learn machine learning models:
-
providing a narrow range for worker ports results in significant slowdowns. Providing a wider range,i.e., 2 ports -> 3ports gives me 3X speedups. I have confirmed this, by keeping num-cpus=2 the same, while increasing the worker port range from 2 allowed ports to 3 allowed ports.
-
There are a lot of worker logs, which show that w 6C34 orker initialization failed because the provided port is in-use (I confirmed the provided ports are open before starting Ray). Here is the message I get in many of the generated logs:
"[2023-03-27 17:14:07,762 E 6879 6879] core_worker.cc:191: Failed to register worker 95603b7eb07be394877abe58e28321552579e9267ed4050acaf0e086 to Raylet. Invalid: Invalid: No available ports. Please specify a wider port range using --min-worker-port and --max-worker-port."
Additional questions based on the issues above:
- Is there any way to avoid the slow down, while keeping a tight range for worker ports?
- From the logs it seems that Ray doesn't follow the hard-cap provided by min/max-worker-port range and num-cpus and it actually starts more workers. For example, in my case I only want to start exactly 2 workers, while providing a 2-port range for workers, and fixing num-cpus=2.
- What is the relation between num-cpus and the number of workers? For example, I have tested that I can successfully start a Ray cluster with num-cpus=5 while keeping min/max-worker-port range to only 2 ports.
- Overall, I couldn't find any clear documentation on how many ports Ray uses. Although I found this document to set the ports for all the elements in Ray, I am still not sure if these ports are the only ones that Ray uses. This is because on my system, I am only allowed to use a set of narrow ports so I need to find exactly how many ports are needed for Ray head and worker nodes.
Versions / Dependencies
Ray 2.3.0
Reproduction script
Working on it
Issue Severity
High: It blocks me from completing my task.