Open
Description
What happened + What you expected to happen
Expect parallel training to succeed, but constantly fail with following error.
Error message:
Traceback (most recent call last):
File "bytegnn/examples/bgtrain.py", line 471, in <module>
trainer.run(train_func, config=train_config)
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 334, in run
for intermediate_result in iterator:
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 716, in __next__
next_results = self._run_with_error_handling(self._fetch_next_result)
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 687, in _run_with_error_handling
return func()
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 742, in _fetch_next_result
results = self._backend_executor.get_next_results()
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/utils.py", line 173, in <lambda>
return lambda *args, **kwargs: ray.get(actor_method.remote(*args, **kwargs))
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/worker.py", line 1809, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.get_next_results() (pid=1368173, ip=10.231.243.221, repr=<ray.train.backend.BackendExecutor object at 0x7f7432025220>)
File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/backend.py", line 483, in get_next_results
raise RuntimeError(
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.save_checkpoint()` are called the same number of times on all workers.
Versions / Dependencies
ray==1.12.1
Reproduction script
n/a
We are using RayTrain to do parallel training, then constantly run into this error once scaling up. We also observed that it is less likely to trigger when there are exactly 20 trainers in parallel ( testing done on single machine setup).
I understand in more recent releases, this issue has been mitigated, though we have been using RayTrain interface and it will take time to migrate to 2.0 interface. If there is no 1.x version that gets this issue fixed, is there any work around we can take to prevent this from happening?
Thanks.
Issue Severity
High: It blocks me from completing my task.