8000 [Core] When pinning object, transient error on RPC `PubsubLongPolling` causes job stuck · Issue #54081 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core] When pinning object, transient error on RPC PubsubLongPolling causes job stuck #54081
Open
@qts0312

Description

@qts0312

What happened + What you expected to happen

When running ray in our cluster, we observed a bug which caused job stuck util timeout. When pinning object by keeping its reference, RPC PubsubLongPolling encountered a transient network failure. Then the job stuck on ray.get unexpectedly.

Here is the concrete call site of RPC that encountered transient error. call_site.log

By analyzing the log of raylet, we noticed something abnormal, which may be related to this bug.

[2025-06-25 04:59:40,567 W 501 501] (raylet) pull_manager.cc:501: Object neither in memory nor external storage 00ffffffffffffffffffffffffffffffffffffff1200000001e1f505

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.

import ray


ray.init()

obj = np.ones(200 * 1024, dtype=np.uint8)
x_id = ray.put(obj)

for _ in range(10):
    ray.put(np.zeros(10 * 1024 * 1024))
assert (ray.get(x_id) == obj).all()

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray Corestability

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0