Open
Description
What happened + What you expected to happen
I expect that Step 6 in the "Reproduction script" should succeed. In addition, I try to use kubectl exec -it ${HEAD_POD} -- bash
to log into the head Pod, and then run the command python3 test_detached_actor_2.py
. The command succeeds.
Versions / Dependencies
See the "Reproduction script" section.
Reproduction script
Prepare Python scripts
-
Create a directory
my_working_dir
, and put the following two Python scripts into this directory.test_detached_actor_1.py
import ray ray.init(address='ray://127.0.0.1:10001', namespace="detached_actor_ns") @ray.remote class TestCounter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value tc = TestCounter.options(name="testCounter", lifetime="detached", max_restarts=-1).remote() val1 = ray.get(tc.increment.remote()) val2 = ray.get(tc.increment.remote()) print(f"val1: {val1}, val2: {val2}") assert(val1 == 1) assert(val2 == 2)
test_detached_actor_2.py
import ray # Try to connect to Ray cluster. print("Try to connect to Ray cluster.") ray.init(address='ray://127.0.0.1:10001', namespace="detached_actor_ns") # Get TestCounter actor print("Get TestCounter actor.") tc = ray.get_actor("testCounter") print("Try to call remote function \'increment\'.") val = ray.get(tc.increment.remote()) print(f"val: {val}") # The actual value should be 1 rather than 2. Ray will launch all registered actors when # the ray cluster restarts, but the internal state of the state will not be restored.
Reproduction script
# Step 1: Create a Kind cluster
kind create cluster
# Step 2: Install a static Ray cluster with a Redis for fault tolerance
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml
# Step 3: Forward port 8265 for both Dashboard and Ray job submission
kubectl port-forward service/service-ray-cluster 8265:8265
# Step 4: Install a detached actor in the Ray's namespace "detached_actor_ns"
ray job submit --working-dir my_working_dir --address http://localhost:8265 -- python test_detached_actor_1.py
# Step 5: Kill the GCS server process in the head Pod, and wait the head container to restart.
kubectl exec -it ${HEAD_POD} -- pkill gcs_server
# Step 6: The detached actor needs tens of seconds to restore after the head Pod restarts.
ray job submit --working-dir my_working_dir --address http://localhost:8265 -- python test_detached_actor_2.py
Step 6 does not work
Issue Severity
None