[Jobs] ray job submission cannot schedule Job supervisor after the head Pod restores

What happened + What you expected to happen

I expect that Step 6 in the "Reproduction script" should succeed. In addition, I try to use kubectl exec -it ${HEAD_POD} -- bash to log into the head Pod, and then run the command python3 test_detached_actor_2.py. The command succeeds.

Versions / Dependencies

See the "Reproduction script" section.

Reproduction script

Prepare Python scripts

Create a directory my_working_dir, and put the following two Python scripts into this directory.

test_detached_actor_1.py

import ray

ray.init(address='ray://127.0.0.1:10001', namespace="detached_actor_ns")

@ray.remote
class TestCounter:
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1
        return self.value

tc = TestCounter.options(name="testCounter", lifetime="detached", max_restarts=-1).remote()
val1 = ray.get(tc.increment.remote())
val2 = ray.get(tc.increment.remote())
print(f"val1: {val1}, val2: {val2}")

assert(val1 == 1)
assert(val2 == 2)

test_detached_actor_2.py

import ray

# Try to connect to Ray cluster.
print("Try to connect to Ray cluster.")
ray.init(address='ray://127.0.0.1:10001', namespace="detached_actor_ns")

# Get TestCounter actor
print("Get TestCounter actor.")
tc = ray.get_actor("testCounter")

print("Try to call remote function \'increment\'.")
val = ray.get(tc.increment.remote())
print(f"val: {val}")
# The actual value should be 1 rather than 2. Ray will launch all registered actors when
# the ray cluster restarts, but the internal state of the state will not be restored.

Reproduction script

# Step 1: Create a Kind cluster
kind create cluster

# Step 2: Install a static Ray cluster with a Redis for fault tolerance
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml

# Step 3: Forward port 8265 for both Dashboard and Ray job submission
kubectl port-forward service/service-ray-cluster 8265:8265

# Step 4: Install a detached actor in the Ray's namespace "detached_actor_ns"
ray job submit --working-dir my_working_dir --address http://localhost:8265 -- python test_detached_actor_1.py

# Step 5: Kill the GCS server process in the head Pod, and wait the head container to restart.
kubectl exec -it ${HEAD_POD} -- pkill gcs_server

# Step 6: The detached actor needs tens of seconds to restore after the head Pod restarts.
ray job submit --working-dir my_working_dir --address http://localhost:8265 -- python test_detached_actor_2.py

Step 6 does not work

Issue Severity

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Prepare Python scripts

Reproduction script

Step 6 does not work

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Prepare Python scripts

Reproduction script

Step 6 does not work

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions