8000 Ray.wait causes node to hang if there are too many object ids · Issue #6403 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Ray.wait causes node to hang if there are too many object ids #6403
Open
@JovanCe

Description

@JovanCe

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.6 LTS
  • Ray installed from (source or binary): binary
  • Ray version: 0.7.6
  • Python version: 3.7.3
  • Exact command to reproduce:

Describe the problem

I have a fixed amount of workers that produce a couple hundred thousand results each. I pool all the object ids into a single list which roughly has around 10000000 items. If I use ray.wait on this list the node hangs. Timeout doesn't help, it hangs either way. Below is a minimal example.

Source code / logs

import time
import ray

ray.init()

@ray.remote(num_return_vals=1000000)
def test():
    time.sleep(60)
    return list(range(1000000))

results = []
for i in range(10):
    results.extend(test.remote())

ray.wait(results, timeout=20)

2019-12-09 17:27:45,501 WARNING worker.py:1619 -- The node with client ID 8bf2d3fbdbe7ae98544e0222f85c3cdb6f5f6f11 has been marked dead because the monitor has missed too many heartbeats from it.

(pid=raylet) F1209 17:28:00.863570 16313 node_manager.cc:487]  Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet)     @           0x6f8d1a  google::LogMessage::Fail()
(pid=raylet)     @           0x6fa103  google::LogMessage::SendToLog()
(pid=raylet)     @           0x6f8a42  google::LogMessage::Flush()
(pid=raylet)     @           0x6f8c31  google::LogMessage::~LogMessage()
(pid=raylet)     @           0x52b112  ray::RayLog::~RayLog()
(pid=raylet)     @           0x466482  ray::raylet::NodeManager::ClientRemoved()
(pid=raylet)     @           0x4b76ee  ray::gcs::ClientTable::HandleNotification()
(pid=raylet)     @           0x4d304b  _ZNSt17_Function_handlerIFvPN3ray3gcs14RedisGcsClientERKNS0_8ClientIDERKSt6vectorINS0_3rpc11GcsNodeInfoESaIS9_EEEZZNS1_11ClientTable7ConnectERKS9_ENKUlS3_RKNS0_8UniqueIDESH_E_clES3_SK_SH_EUlS3_SK_SD_E_E9_M_invokeERKSt9_Any_dataS3_S6_SD_
(pid=raylet)     @           0x4d2706  _ZNSt17_Function_handlerIFvPN3ray3gcs14RedisGcsClientERKNS0_8ClientIDENS0_3rpc13GcsChangeModeERKSt6vectorINS7_11GcsNodeInfoESaISA_EEEZNS1_3LogIS4_SA_E9SubscribeERKNS0_5JobIDES6_RKSt8functionIFvS3_S6_SE_EERKSL_IFvS3_EEEUlS3_S6_S8_SE_E_E9_M_invokeERKSt9_Any_dataS3_S6_S8_SE_
(pid=raylet)     @           0x4b5673  _ZZN3ray3gcs3LogINS_8ClientIDENS_3rpc11GcsNodeInfoEE9SubscribeERKNS_5JobIDERKS2_RKSt8functionIFvPNS0_14RedisGcsClientESA_NS3_13GcsChangeModeERKSt6vectorIS4_SaIS4_EEEERKSB_IFvSD_EEENKUlRKNS0_13CallbackReplyEE_clESU_
(pid=raylet)     @           0x4dacb9  ray::gcs::GlobalRedisCallback()
(pid=raylet)     @           0x4df9cb  redisProcessCallbacks
(pid=raylet)     @           0x4de726  RedisAsioClient::handle_read()
(pid=raylet)     @           0x4dd958  boost::asio::detail::reactive_null_buffers_op<>::do_complete()
(pid=raylet)     @           0x425bcd  boost::asio::detail::scheduler::run()
(pid=raylet)     @           0x40fb1d  main
(pid=raylet)     @     0x7f03ca300830  __libc_start_main
(pid=raylet)     @           0x4207e1  (unknown)

After this the raylet dies.

Any ideas?
Is there any other way to receive results in the order they're ready aside from this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tcommunity-backlogpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.stability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0