[Core|RayTrain] RuntimeError: Some workers returned results while others didn't

@matt

What happened + What you expected to happen

Expect parallel training to succeed, but constantly fail with following error.

Error message:

Traceback (most recent call last):
  File "bytegnn/examples/bgtrain.py", line 471, in <module>
    trainer.run(train_func, config=train_config)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 334, in run
    for intermediate_result in iterator:
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 716, in __next__
    next_results = self._run_with_error_handling(self._fetch_next_result)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 687, in _run_with_error_handling
    return func()
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 742, in _fetch_next_result
    results = self._backend_executor.get_next_results()
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/utils.py", line 173, in <lambda>
    return lambda *args, **kwargs: ray.get(actor_method.remote(*args, **kwargs))
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/worker.py", line 1809, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.get_next_results() (pid=1368173, ip=10.231.243.221, repr=<ray.train.backend.BackendExecutor object at 0x7f7432025220>)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/backend.py", line 483, in get_next_results
    raise RuntimeError(
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.save_checkpoint()` are called the same number of times on all workers.

Versions / Dependencies

ray==1.12.1

Reproduction script

n/a

We are using RayTrain to do parallel training, then constantly run into this error once scaling up. We also observed that it is less likely to trigger when there are exactly 20 trainers in parallel ( testing done on single machine setup).

I understand in more recent releases, this issue has been mitigated, though we have been using RayTrain interface and it will take time to migrate to 2.0 interface. If there is no 1.x version that gets this issue fixed, is there any work around we can take to prevent this from happening?

Thanks.

@matt

Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions