8000 [Tune] failure when using more than one GPU · Issue #32760 · ray-project/ray · GitHub < 8000 meta name="octolytics-dimension-repository_network_root_nwo" content="ray-project/ray" />
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Tune] failure when using more than one GPU #32760
Open
@fehtemam

Description

@fehtemam

What happened + What you expected to happen

  1. The bug when switching from one GPU core to anything with >1 cores:
ValueError                                Traceback (most recent call last)
File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:948, in TrialRunner._wait_and_handle_event(self, next_trial)
    947 if event.type == _ExecutorEventType.TRAINING_RESULT:
--> 948     self._on_training_result(
    949         trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
    950     )
    951 else:

File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1073, in TrialRunner._on_training_result(self, trial, result)
   1072 with warn_if_slow("process_trial_result"):
-> 1073     self._process_trial_results(trial, result)

File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1156, in TrialRunner._process_trial_results(self, trial, results)
   1155 with warn_if_slow("process_trial_result"):
-> 1156     decision = self._process_trial_result(trial, result)
   1157 if decision is None:
   1158     # If we didn't get a decision, this means a
   1159     # non-training future (e.g. a save) was scheduled.
   1160     # We do not allow processing more results then.

File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1193, in TrialRunner._process_trial_result(self, trial, result)
   1192 flat_result = flatten_dict(result)
-> 1193 self._validate_result_metrics(flat_result)
   1195 if self._stopper(trial.trial_id, result) or trial.should_stop(flat_result):
...
    280     experiment_checkpoint_dir = ray.get(
    281         self._remote_tuner.get_experiment_checkpoint_dir.remote()
    282     )

TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/home/ubuntu/ray_results/train_tune_2023-02-22_23-38-08")`.

This is happening when using Nixtla library neuralforecast which calls Tune for its hyperparameter tuning.
This is done on an EC2 instance g3.8xlarge on Ubuntu 22.04 but it is reproducible using other EC2 instances with GPU cores >1.

  1. Expected behavior is no error as it is the case with only one GPU core.

Versions / Dependencies

OS: Ubuntu 22.04
Python: 3.10.10
neuralforecast==1.4.0
numpy==1.23.5
pandas==1.5.2
ray==2.2.0

Reproduction script

Here is a reproducible example:

from neuralforecast import NeuralForecast
from neuralforecast.auto import AutoNHITS
import numpy as np
import pandas as pd
import ray

df_t = pd.DataFrame(columns=['unique_id', 'ds', 'y'])
df_t['ds'] = pd.date_range('2018-01-01', '2022-01-02', freq='D')
df_t['unique_id'] = 'series_1'
rng = np.random.default_rng(seed=6)
df_t['y'] = rng.uniform(low=0, high=1, size=1463)

hrz = 365
models = [
    AutoNHITS(h=hrz, gpus=2)
]

nforecast = NeuralForecast(models=models, freq='D')
nforecast.fit(df=df_t)

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tpending-cleanupThis issue is pending cleanup. It will be removed in 2 weeks after being assigned.tuneTune-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0