Open
Description
What happened + What you expected to happen
- The bug when switching from one GPU core to anything with >1 cores:
ValueError Traceback (most recent call last)
File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:948, in TrialRunner._wait_and_handle_event(self, next_trial)
947 if event.type == _ExecutorEventType.TRAINING_RESULT:
--> 948 self._on_training_result(
949 trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
950 )
951 else:
File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1073, in TrialRunner._on_training_result(self, trial, result)
1072 with warn_if_slow("process_trial_result"):
-> 1073 self._process_trial_results(trial, result)
File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1156, in TrialRunner._process_trial_results(self, trial, results)
1155 with warn_if_slow("process_trial_result"):
-> 1156 decision = self._process_trial_result(trial, result)
1157 if decision is None:
1158 # If we didn't get a decision, this means a
1159 # non-training future (e.g. a save) was scheduled.
1160 # We do not allow processing more results then.
File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1193, in TrialRunner._process_trial_result(self, trial, result)
1192 flat_result = flatten_dict(result)
-> 1193 self._validate_result_metrics(flat_result)
1195 if self._stopper(trial.trial_id, result) or trial.should_stop(flat_result):
...
280 experiment_checkpoint_dir = ray.get(
281 self._remote_tuner.get_experiment_checkpoint_dir.remote()
282 )
TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/home/ubuntu/ray_results/train_tune_2023-02-22_23-38-08")`.
This is happening when using Nixtla library neuralforecast which calls Tune for its hyperparameter tuning.
This is done on an EC2 instance g3.8xlarge on Ubuntu 22.04 but it is reproducible using other EC2 instances with GPU cores >1.
- Expected behavior is no error as it is the case with only one GPU core.
Versions / Dependencies
OS: Ubuntu 22.04
Python: 3.10.10
neuralforecast==1.4.0
numpy==1.23.5
pandas==1.5.2
ray==2.2.0
Reproduction script
Here is a reproducible example:
from neuralforecast import NeuralForecast
from neuralforecast.auto import AutoNHITS
import numpy as np
import pandas as pd
import ray
df_t = pd.DataFrame(columns=['unique_id', 'ds', 'y'])
df_t['ds'] = pd.date_range('2018-01-01', '2022-01-02', freq='D')
df_t['unique_id'] = 'series_1'
rng = np.random.default_rng(seed=6)
df_t['y'] = rng.uniform(low=0, high=1, size=1463)
hrz = 365
models = [
AutoNHITS(h=hrz, gpus=2)
]
nforecast = NeuralForecast(models=models, freq='D')
nforecast.fit(df=df_t)
Issue Severity
High: It blocks me from completing my task.