[Tune] failure when using more than one GPU

What happened + What you expected to happen

The bug when switching from one GPU core to anything with >1 cores:

ValueError                                Traceback (most recent call last)
File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:948, in TrialRunner._wait_and_handle_event(self, next_trial)
    947 if event.type == _ExecutorEventType.TRAINING_RESULT:
--> 948     self._on_training_result(
    949         trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
    950     )
    951 else:

File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1073, in TrialRunner._on_training_result(self, trial, result)
   1072 with warn_if_slow("process_trial_result"):
-> 1073     self._process_trial_results(trial, result)

File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1156, in TrialRunner._process_trial_results(self, trial, results)
   1155 with warn_if_slow("process_trial_result"):
-> 1156     decision = self._process_trial_result(trial, result)
   1157 if decision is None:
   1158     # If we didn't get a decision, this means a
   1159     # non-training future (e.g. a save) was scheduled.
   1160     # We do not allow processing more results then.

File ~/Code/Gits/demand-forecasting/venv/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1193, in TrialRunner._process_trial_result(self, trial, result)
   1192 flat_result = flatten_dict(result)
-> 1193 self._validate_result_metrics(flat_result)
   1195 if self._stopper(trial.trial_id, result) or trial.should_stop(flat_result):
...
    280     experiment_checkpoint_dir = ray.get(
    281         self._remote_tuner.get_experiment_checkpoint_dir.remote()
    282     )

TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/home/ubuntu/ray_results/train_tune_2023-02-22_23-38-08")`.

This is happening when using Nixtla library neuralforecast which calls Tune for its hyperparameter tuning.
This is done on an EC2 instance g3.8xlarge on Ubuntu 22.04 but it is reproducible using other EC2 instances with GPU cores >1.

Expected behavior is no error as it is the case with only one GPU core.

Versions / Dependencies

OS: Ubuntu 22.04
Python: 3.10.10
neuralforecast==1.4.0
numpy==1.23.5
pandas==1.5.2
ray==2.2.0

Reproduction script

Here is a reproducible example:

from neuralforecast import NeuralForecast
from neuralforecast.auto import AutoNHITS
import numpy as np
import pandas as pd
import ray

df_t = pd.DataFrame(columns=['unique_id', 'ds', 'y'])
df_t['ds'] = pd.date_range('2018-01-01', '2022-01-02', freq='D')
df_t['unique_id'] = 'series_1'
rng = np.random.default_rng(seed=6)
df_t['y'] = rng.uniform(low=0, high=1, size=1463)

hrz = 365
models = [
    AutoNHITS(h=hrz, gpus=2)
]

nforecast = NeuralForecast(models=models, freq='D')
nforecast.fit(df=df_t)

Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions