8000 [Core|RayTrain] RuntimeError: Some workers returned results while others didn't · Issue #30545 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Core|RayTrain] RuntimeError: Some workers returned results while others didn't #30545
Open
@HuangLED

Description

@HuangLED

What happened + What you expected to happen

Expect parallel training to succeed, but constantly fail with following error.

Error message:

Traceback (most recent call last):
  File "bytegnn/examples/bgtrain.py", line 471, in <module>
    trainer.run(train_func, config=train_config)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 334, in run
    for intermediate_result in iterator:
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 716, in __next__
    next_results = self._run_with_error_handling(self._fetch_next_result)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 687, in _run_with_error_handling
    return func()
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/trainer.py", line 742, in _fetch_next_result
    results = self._backend_executor.get_next_results()
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/utils.py", line 173, in <lambda>
    return lambda *args, **kwargs: ray.get(actor_method.remote(*args, **kwargs))
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/worker.py", line 1809, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::BackendExecutor.get_next_results() (pid=1368173, ip=10.231.243.221, repr=<ray.train.backend.BackendExecutor object at 0x7f7432025220>)
  File "/home/ruoyun.huang/anaconda3/envs/py38/lib/python3.8/site-packages/ray/train/backend.py", line 483, in get_next_results
    raise RuntimeError(
RuntimeError: Some workers returned results while others didn't. Make sure that `train.report()` and `train.save_checkpoint()` are called the same number of times on all workers.

Versions / Dependencies

ray==1.12.1

Reproduction script

n/a

We are using RayTrain to do parallel training, then constantly run into this error once scaling up. We also observed that it is less likely to trigger when there are exactly 20 trainers in parallel ( testing done on single machine setup).

I understand in more recent releases, this issue has been mitigated, though we have been using RayTrain interface and it will take time to migrate to 2.0 interface. If there is no 1.x version that gets this issue fixed, is there any work around we can take to prevent this from happening?

Thanks.

@matt

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.trainRay Train Related Issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0