Open
Description
What happened + What you expected to happen
When running ray in our cluster, we observed a bug which caused job stuck util timeout. When pinning object by keeping its reference, RPC PubsubLongPolling
encountered a transient network failure. Then the job stuck on ray.get
unexpectedly.
Here is the concrete call site of RPC that encountered transient error. call_site.log
By analyzing the log of raylet, we noticed something abnormal, which may be related to this bug.
[2025-06-25 04:59:40,567 W 501 501] (raylet) pull_manager.cc:501: Object neither in memory nor external storage 00ffffffffffffffffffffffffffffffffffffff1200000001e1f505
Versions / Dependencies
Ray 3.0.0.dev, Kuberay 1.3.0
Reproduction script
Start a RayCluster
using Kuberay
. Then run the following script with ray job submit SDK.
import ray
ray.init()
obj = np.ones(200 * 1024, dtype=np.uint8)
x_id = ray.put(obj)
for _ in range(10):
ray.put(np.zeros(10 * 1024 * 1024))
assert (ray.get(x_id) == obj).all()
Transient network failure can be reproduced with gRPC interceptor.
Issue Severity
Medium: It is a significant difficulty but I can work around it.