8000 [Jobs] ray job submission cannot schedule Job supervisor after the head Pod restores · Issue #32167 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Jobs] ray job submission cannot schedule Job supervisor after the head Pod restores #32167
Open
@kevin85421

Description

@kevin85421

What happened + What you expected to happen

I expect that Step 6 in the "Reproduction script" should succeed. In addition, I try to use kubectl exec -it ${HEAD_POD} -- bash to log into the head Pod, and then run the command python3 test_detached_actor_2.py. The command succeeds.

Versions / Dependencies

See the "Reproduction script" section.

Reproduction script

Prepare Python scripts

  • Create a directory my_working_dir, and put the following two Python scripts into this directory.

    test_detached_actor_1.py
    import ray
    
    ray.init(address='ray://127.0.0.1:10001', namespace="detached_actor_ns")
    
    @ray.remote
    class TestCounter:
        def __init__(self):
            self.value = 0
    
        def increment(self):
            self.value += 1
            return self.value
    
    tc = TestCounter.options(name="testCounter", lifetime="detached", max_restarts=-1).remote()
    val1 = ray.get(tc.increment.remote())
    val2 = ray.get(tc.increment.remote())
    print(f"val1: {val1}, val2: {val2}")
    
    assert(val1 == 1)
    assert(val2 == 2)
    test_detached_actor_2.py
    import ray
    
    # Try to connect to Ray cluster.
    print("Try to connect to Ray cluster.")
    ray.init(address='ray://127.0.0.1:10001', namespace="detached_actor_ns")
    
    # Get TestCounter actor
    print("Get TestCounter actor.")
    tc = ray.get_actor("testCounter")
    
    print("Try to call remote function \'increment\'.")
    val = ray.get(tc.increment.remote())
    print(f"val: {val}")
    # The actual value should be 1 rather than 2. Ray will launch all registered actors when
    # the ray cluster restarts, but the internal state of the state will not be restored.

Reproduction script

# Step 1: Create a Kind cluster
kind create cluster

# Step 2: Install a static Ray cluster with a Redis for fault tolerance
kubectl apply -f https://raw.githubusercontent.com/ray-project/ray/master/doc/source/cluster/kubernetes/configs/static-ray-cluster.with-fault-tolerance.yaml

# Step 3: Forward port 8265 for both Dashboard and Ray job submission
kubectl port-forward service/service-ray-cluster 8265:8265

# Step 4: Install a detached actor in the Ray's namespace "detached_actor_ns"
ray job submit --working-dir my_working_dir --address http://localhost:8265 -- python test_detached_actor_1.py

# Step 5: Kill the GCS server process in the head Pod, and wait the head container to restart.
kubectl exec -it ${HEAD_POD} -- pkill gcs_server

# Step 6: The detached actor needs tens of seconds to restore after the head Pod restarts.
ray job submit --working-dir my_working_dir --address http://localhost:8265 -- python test_detached_actor_2.py      

Step 6 does not work

Screen Shot 2023-02-01 at 7 40 20 AM

Issue Severity

None

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tjobs

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0