[Core] When pinning object, transient error on RPC PubsubLongPolling causes job stuck

What happened + What you expected to happen

When running ray in our cluster, we observed a bug which caused job stuck util timeout. When pinning object by keeping its reference, RPC PubsubLongPolling encountered a transient network failure. Then the job stuck on ray.get unexpectedly.

Here is the concrete call site of RPC that encountered transient error. call_site.log

By analyzing the log of raylet, we noticed something abnormal, which may be related to this bug.

[2025-06-25 04:59:40,567 W 501 501] (raylet) pull_manager.cc:501: Object neither in memory nor external storage 00ffffffffffffffffffffffffffffffffffffff1200000001e1f505

Versions / Dependencies

Ray 3.0.0.dev, Kuberay 1.3.0

Reproduction script

Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.

import ray


ray.init()

obj = np.ones(200 * 1024, dtype=np.uint8)
x_id = ray.put(obj)

for _ in range(10):
    ray.put(np.zeros(10 * 1024 * 1024))
assert (ray.get(x_id) == obj).all()

Transient network failure can be reproduced with gRPC interceptor.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions