8000 [Q]: UsageError: Unable to attach to run ... · Issue #9948 · wandb/wandb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Q]: UsageError: Unable to attach to run ... #9948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service 8000 and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Sascha-Roe opened this issue Jun 1, 2025 · 2 comments
Open

[Q]: UsageError: Unable to attach to run ... #9948

Sascha-Roe opened this issue Jun 1, 2025 · 2 comments
Labels
ty:question type of issue is a question

Comments

@Sascha-Roe
Copy link
Sascha-Roe commented Jun 1, 2025

Hey everyone,

I have trained a TFT-Model using WandB, which worked just fine. But when i try to predict using the trained model i get this error

WandbAttachFailedError: Failed to attach because the run does not belong to the current service process, or because the service process is busy (unlikely)
UsageError: Unable to attach to run g43vyyi7

Has anyone encountered a similar error or knows how to fix this?
I am using wandb version 0.19.11 with python version 3.12.10.

A small example on how I try to make predictions:

from darts.models import TFTModel
model = TFTModel
model_best = model.load_from_checkpoint(work_dir=work_dir, model_name=model_name, best=True)

I then prepare my data for the predictions and try to make the predictions using:

pred_series = model_best.predict(n=pred_size,
                        series=ts_ttest_temp[pred_idxs[0]:pred_idxs[1]],
                        future_covariates= tcox_test_future[pred_idxs[0]:pred_idxs[3]],
                        past_covariates=tcov_test[pred_idxs[0]:pred_idxs[1]],
                        num_samples=1,   
                        n_jobs=-1)

The entire Traceback looks the following:

---------------------------------------------------------------------------
WandbAttachFailedError                    Traceback (most recent call last)
File /srv/jupyterhub/lib/python3.12/site-packages/wandb/sdk/wandb_init.py:1186, in _attach(attach_id, run_id, run)
   1185 try:
-> 1186     attach_settings = service.inform_attach(attach_id=attach_id)
   1187 except Exception as e:

File /srv/jupyterhub/lib/python3.12/site-packages/wandb/sdk/lib/service_connection.py:182, in ServiceConnection.inform_attach(self, attach_id)
    181 except TimeoutError:
--> 182     raise WandbAttachFailedError(
    183         "Failed to attach because the run does not belong to"
    184         " the current service process, or because the service"
    185         " process is busy (unlikely)."
    186     ) from None

WandbAttachFailedError: Failed to attach because the run does not belong to the current service process, or because the service process is busy (unlikely).

The above exception was the direct cause of the following exception:

UsageError                                Traceback (most recent call last)
Cell In[19], line 28
     25 print('pred_idx: ',pred_idxs)
     26 #print(tcov_test.start_time())
     27 #print(tcov_test.end_time())
---> 28 pred_t = evalTFT.pred_multi(model_best, pred_size, pred_idxs, ts_ttest_temp, tcov_test, tcov_test_future)
     29 print("PREDICTED!")
     30 print(pred_t.start_time().weekday())

File ~/Documents/Code/Giaco/Evaluation/evalTFThelper.py:63, in pred_multi(model, pred_size, pred_idxs, ts_ttest_temp, tcov_test, tcox_test_future)
     62 def pred_multi(model, pred_size, pred_idxs, ts_ttest_temp, tcov_test, tcox_test_future):
---> 63     pred_series = model.predict(n=pred_size,
     64                             series=ts_ttest_temp[pred_idxs[0]:pred_idxs[1]],
     65                             future_covariates= tcox_test_future[pred_idxs[0]:pred_idxs[3]],
     66                             past_covariates=tcov_test[pred_idxs[0]:pred_idxs[1]],
     67                             num_samples=1,   
     68                             n_jobs=-1)
     69     return pred_series

File /srv/jupyterhub/lib/python3.12/site-packages/darts/utils/torch.py:80, in random_method.<locals>.decorator(self, *args, **kwargs)
     78 with fork_rng():
     79     manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
---> 80     return decorated(self, *args, **kwargs)

File /srv/jupyterhub/lib/python3.12/site-packages/darts/models/forecasting/torch_forecasting_model.py:1530, in TorchForecastingModel.predict(self, n, series, past_covariates, future_covariates, trainer, batch_size, verbose, n_jobs, roll_size, num_samples, dataloader_kwargs, mc_dropout, predict_likelihood_parameters, show_warnings)
   1511 super().predict(
   1512     n,
   1513     series,
   (...)   1518     show_warnings=show_warnings,
   1519 )
   1521 dataset = self._build_inference_dataset(
   1522     target=series,
   1523     n=n,
   (...)   1527     bounds=None,
   1528 )
-> 1530 predictions = self.predict_from_dataset(
   1531     n,
   1532     dataset,
   1533     trainer=trainer,
   1534     verbose=verbose,
   1535     batch_size=batch_size,
   1536     n_jobs=n_jobs,
   1537     roll_size=roll_size,
   1538     num_samples=num_samples,
   1539     dataloader_kwargs=dataloader_kwargs,
   1540     mc_dropout=mc_dropout,
   1541     predict_likelihood_parameters=predict_likelihood_parameters,
   1542 )
   1544 return predictions[0] if called_with_single_series else predictions

File /srv/jupyterhub/lib/python3.12/site-packages/darts/utils/torch.py:80, in random_method.<locals>.decorator(self, *args, **kwargs)
     78 with fork_rng():
     79     manual_seed(self._random_instance.randint(0, high=MAX_TORCH_SEED_VALUE))
---> 80     return decorated(self, *args, **kwargs)

File /srv/jupyterhub/lib/python3.12/site-packages/darts/models/forecasting/torch_forecasting_model.py:1679, in TorchForecastingModel.predict_from_dataset(self, n, input_series_dataset, trainer, batch_size, verbose, n_jobs, roll_size, num_samples, dataloader_kwargs, mc_dropout, predict_likelihood_parameters)
   1674 self.trainer = self._setup_trainer(
   1675     trainer=trainer, model=self.model, verbose=verbose, epochs=self.n_epochs
   1676 )
   1678 # prediction output comes as nested list: list of predicted `TimeSeries` for each batch.
-> 1679 predictions = self.trainer.predict(model=self.model, dataloaders=pred_loader)
   1680 # flatten and return
   1681 return [ts for batch in predictions for ts in batch]

File /srv/jupyterhub/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:887, in Trainer.predict(self, model, dataloaders, datamodule, return_predictions, ckpt_path)
    885 self.state.status = TrainerStatus.RUNNING
    886 self.predicting = True
--> 887 return call._call_and_handle_interrupt(
    888     self, self._predict_impl, model, dataloaders, datamodule, return_predictions, ckpt_path
    889 )

File /srv/jupyterhub/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py:48, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     46     if trainer.strategy.launcher is not None:
     47         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 48     return trainer_fn(*args, **kwargs)
     50 except _TunerExitException:
     51     _call_teardown_hook(trainer)

File /srv/jupyterhub/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:928, in Trainer._predict_impl(self, model, dataloaders, datamodule, return_predictions, ckpt_path)
    924     download_model_from_registry(ckpt_path, self)
    925 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    926     self.state.fn, ckpt_path, model_provided=model_provided, model_connected=self.lightning_module is not None
    927 )
--> 928 results = self._run(model, ckpt_path=ckpt_path)
    930 assert self.state.stopped
    931 self.predicting = False

File /srv/jupyterhub/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py:974, in Trainer._run(self, model, ckpt_path)
    971 log.debug(f"{self.__class__.__name__}: preparing data")
    972 self._data_connector.prepare_data()
--> 974 call._call_setup_hook(self)  # allow user to set up LightningModule in accelerator environment
    975 log.debug(f"{self.__class__.__name__}: configuring model")
    976 call._call_configure_model(self)

File /srv/jupyterhub/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py:101, in _call_setup_hook(trainer)
     99 # Trigger lazy creation of experiment in loggers so loggers have their metadata available
    100 for logger in loggers:
--> 101     if hasattr(logger, "experiment"):
    102         _ = logger.experiment
    104 trainer.strategy.barrier("pre_setup")

File /srv/jupyterhub/lib/python3.12/site-packages/lightning_fabric/loggers/logger.py:118, in rank_zero_experiment.<locals>.experiment(self)
    116 if rank_zero_only.rank > 0:
    117     return _DummyExperiment()
--> 118 return fn(self)

File /srv/jupyterhub/lib/python3.12/site-packages/pytorch_lightning/loggers/wandb.py:404, in WandbLogger.experiment(self)
    401     self._experiment = wandb.run
    402 elif attach_id is not None and hasattr(wandb, "_attach"):
    403     # attach to wandb process referenced
--> 404     self._experiment = wandb._attach(attach_id)
    405 else:
    406     # create new wandb process
    407     self._experiment = wandb.init(**self._wandb_init)

File /srv/jupyterhub/lib/python3.12/site-packages/wandb/sdk/wandb_init.py:1188, in _attach(attach_id, run_id, run)
   1186     attach_settings = service.inform_attach(attach_id=attach_id)
   1187 except Exception as e:
-> 1188     raise UsageError(f"Unable to attach to run {attach_id}") from e
   1190 settings: Settings = copy.copy(_wl._settings)
   1192 settings.update_from_dict(
   1193     {
   1194         "run_id": attach_id,
   (...)   1197     }
   1198 )

UsageError: Unable to attach to run g43vyyi7
@Sascha-Roe Sascha-Roe added the ty:question type of issue is a question label Jun 1, 2025
Copy link

Thomas Drayton commented:
Hi @Sascha-Roe,

Thanks for reaching out! I appreciate the detail you've provided regarding this issue that you're having.

Based on the traceback, it looks like our service is trying to re-attach to run g43vyyi7 but can’t because it is attempting to connect to a run that doesn’t match the workspace/run-id.

If you don't object, would you also mind sharing:

  • How the original training run was configured? The link to the run in your workspace would be great i.e. exact run ID and workspace (project/entity) you’re trying to attach to?
  • run-id you are passing to your prediction script
  • What PyTorch Lightning version and Darts version you used?
  • A minimal working example of your prediction script, if possible?

Thanks in advance!

Best,
Thomas

@Sascha-Roe
Copy link
Author

Hey Thomas,

I am using darts 0.35.0 and PyTorch Lightning 2.5.1.post0.
The workspace is private, but the run that I'm trying to attach to is visible in my workspace.
In my prediction script i load the model using

model_best = model.load_from_checkpoint(work_dir=work_dir, model_name=model_name, best=True)

where

work_dir = "./models/first_runs/"
model_name = "warm-waterfall-26"

The warm-waterfall-26 model is actually located in ./models/first_runs/
A minimal example for the prediction can be found in the original post.

The training is configured the following:

First all arguments are read in, then the training is started using:

run_name = wandb_go(args)
SAVE = '/models/first_runs/' + run_name +'.pth.tar'
model = define_model(args, run_name)

model.fit(ts_ttrain_list, 
                future_covariates=[tcov_train_future] * num_knoten, 
                past_covariates=[tcov_train] * num_knoten,
                verbose=True,
                val_series=ts_ttest_list,
                val_future_covariates=[tcov_test_future] * num_knoten,
                val_past_covariates=[tcov_test] * num_knoten
            )
wandb.finish()
def wandb_go(args):
    '''Start wandb session with parameters'''
    wandb.init(project=args.project_name, entity="MY_ENTITY", sync_tensorboard=True, config=args)
    name = wandb.run.name
    print("Name of run for wandb: ", name)
    return name
def define_model(args, model_name):
    wandb_logger = WandbLogger() 
    lr_monitor = LearningRateMonitor(logging_interval='step')
    n_categories = 70  # how many nodes exist
    embedding_size = 70  # embed the categorical variable into a numeric vector of size 2
    categorical_embedding_sizes = {"Knoten": (n_categories, embedding_size)}

    model = TFTModel(input_chunk_length=args.back_window,
                output_chunk_length=args.horizon,
                hidden_size=args.hidden,
                lstm_layers=args.lstm_layers,
                num_attention_heads=args.att_heads,
                full_attention=args.full_att,
                dropout=args.dropout,
                batch_size=args.batch_size,
                n_epochs=args.epochs,
                likelihood=args.likelihood, 
                loss_fn=args.loss,
                lr_scheduler_cls=args.decay_lr_class,
                lr_scheduler_kwargs={"gamma":0.1},
                random_state=args.rand, 
                force_reset=True,
                log_tensorboard=True,
                save_checkpoints=True,
                model_name=model_name,
                categorical_embedding_sizes=categorical_embedding_sizes,
                work_dir = "./models/first_runs",
                pl_trainer_kwargs={
                    "accelerator": "gpu",
                    "devices": -1, 
                    "logger":[wandb_logger],
                    "callbacks":[lr_monitor]
                }) 
    return model

I hope this helps.

Thank you for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ty:question type of issue is a question
Projects
None yet
Development

No branches or pull requests

1 participant
0