Open
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04.6 LTS
- Ray installed from (source or binary): binary
- Ray version: 0.7.6
- Python version: 3.7.3
- Exact command to reproduce:
Describe the problem
I have a fixed amount of workers that produce a couple hundred thousand results each. I pool all the object ids into a single list which roughly has around 10000000 items. If I use ray.wait on this list the node hangs. Timeout doesn't help, it hangs either way. Below is a minimal example.
Source code / logs
import time
import ray
ray.init()
@ray.remote(num_return_vals=1000000)
def test():
time.sleep(60)
return list(range(1000000))
results = []
for i in range(10):
results.extend(test.remote())
ray.wait(results, timeout=20)
2019-12-09 17:27:45,501 WARNING worker.py:1619 -- The node with client ID 8bf2d3fbdbe7ae98544e0222f85c3cdb6f5f6f11 has been marked dead because the monitor has missed too many heartbeats from it.
(pid=raylet) F1209 17:28:00.863570 16313 node_manager.cc:487] Check failed: client_id != gcs_client_->client_table().GetLocalClientId() Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
(pid=raylet) @ 0x6f8d1a google::LogMessage::Fail()
(pid=raylet) @ 0x6fa103 google::LogMessage::SendToLog()
(pid=raylet) @ 0x6f8a42 google::LogMessage::Flush()
(pid=raylet) @ 0x6f8c31 google::LogMessage::~LogMessage()
(pid=raylet) @ 0x52b112 ray::RayLog::~RayLog()
(pid=raylet) @ 0x466482 ray::raylet::NodeManager::ClientRemoved()
(pid=raylet) @ 0x4b76ee ray::gcs::ClientTable::HandleNotification()
(pid=raylet) @ 0x4d304b _ZNSt17_Function_handlerIFvPN3ray3gcs14RedisGcsClientERKNS0_8ClientIDERKSt6vectorINS0_3rpc11GcsNodeInfoESaIS9_EEEZZNS1_11ClientTable7ConnectERKS9_ENKUlS3_RKNS0_8UniqueIDESH_E_clES3_SK_SH_EUlS3_SK_SD_E_E9_M_invokeERKSt9_Any_dataS3_S6_SD_
(pid=raylet) @ 0x4d2706 _ZNSt17_Function_handlerIFvPN3ray3gcs14RedisGcsClientERKNS0_8ClientIDENS0_3rpc13GcsChangeModeERKSt6vectorINS7_11GcsNodeInfoESaISA_EEEZNS1_3LogIS4_SA_E9SubscribeERKNS0_5JobIDES6_RKSt8functionIFvS3_S6_SE_EERKSL_IFvS3_EEEUlS3_S6_S8_SE_E_E9_M_invokeERKSt9_Any_dataS3_S6_S8_SE_
(pid=raylet) @ 0x4b5673 _ZZN3ray3gcs3LogINS_8ClientIDENS_3rpc11GcsNodeInfoEE9SubscribeERKNS_5JobIDERKS2_RKSt8functionIFvPNS0_14RedisGcsClientESA_NS3_13GcsChangeModeERKSt6vectorIS4_SaIS4_EEEERKSB_IFvSD_EEENKUlRKNS0_13CallbackReplyEE_clESU_
(pid=raylet) @ 0x4dacb9 ray::gcs::GlobalRedisCallback()
(pid=raylet) @ 0x4df9cb redisProcessCallbacks
(pid=raylet) @ 0x4de726 RedisAsioClient::handle_read()
(pid=raylet) @ 0x4dd958 boost::asio::detail::reactive_null_buffers_op<>::do_complete()
(pid=raylet) @ 0x425bcd boost::asio::detail::scheduler::run()
(pid=raylet) @ 0x40fb1d main
(pid=raylet) @ 0x7f03ca300830 __libc_start_main
(pid=raylet) @ 0x4207e1 (unknown)
After this the raylet dies.
Any ideas?
Is there any other way to receive results in the order they're ready aside from this?