Resuming grid sweep does not launch missing runs #1787

ExpectationMax · 2021-02-01T21:19:35Z

Describe the bug
After removing some runs from a completed grid sweep and resuming the sweep, no new runs are passed to the agents and the sweep goes back into the complete state.

This is in contrast to the documentation which describes that the missing configurations will be launched when the sweep is resumed: https://docs.wandb.ai/sweeps/faq#rerun-grid-search

To Reproduce
Steps to reproduce the behavior:

Create a sweep for example:

method: grid
parameters:
  batch_size:
    values:
    - 20
program: train.py

Launch an agents for the sweep
Remove the entry of the run when completed
Resume the sweep
Launch a new agent to run the missing sweep
Wandb agent will give the following output

wandb: Starting wandb agent :sleuth_or_spy:
2021-02-01 21:32:51,562 - wandb.wandb_agent - INFO - Running runs: []
2021-02-01 21:32:51,915 - wandb.wandb_agent - INFO - Agent received command: exit
2021-02-01 21:32:51,915 - wandb.wandb_agent - INFO - Received exit command. Killing runs and quitting.
wandb: Terminating and syncing runs. Press ctrl-c to kill.

The state of the sweep will go back to completed

Expected behavior
The missing sweep should be run by the agent after it has been deleted from the sweep.

The text was updated successfully, but these errors were encountered:

ariG23498 · 2021-02-04T07:46:03Z

Hey @ExpectationMax
I tried reproducing the issue and these are my observations 👇

Code

I have tried out with a very basic example to showcase the features of resuming the grid sweep.
train.py

import wandb

# Set up your default hyperparameters before wandb.init
# so they get properly set in the sweep
hyperparameter_defaults = {
    'epoch':2
}

# Pass your defaults to wandb.init
wandb.init(config=hyperparameter_defaults)
config = wandb.config

# Log metrics inside your training loop
metrics = {'custom_metric': config["epoch"]/2}
wandb.log(metrics)

sweeps.yaml

program: train.py
method: grid
metric:
  name: custom_metric
  goal: minimize
parameters:
  epoch:
    values:
      - 2
      - 4

Steps

With both the train.py and sweeps.yaml at hand, I ran the sweep with the command $ wandb sweep sweeps.yaml
This gives me a command to run the sweep agent.
After I run the agent for the sweep my dashboard looks like this

Here we can see that both the runs have been logged with epoch values 2 and 4.
After the agent terminates, I do a ctrl-c in the terminal to close the sweep.
I go into the dashboard and delete the run that I want my agent to resume.

Here one can see that the first run has been deleted.
After the run has been deleted, I go into the control-page of the sweep and click on the resume button.
With the resume button clicked, I run the same sweep agent again from the terminal.
After the process terminates this is how my dashboard looks

I think the detailed walkthrough would make it feasible for you to combat your issues. Feel free to write in if this does not help you.

jvlmdr · 2022-06-14T07:28:44Z

When following these instructions, I did not see a "Resume" button on the sweep control page. There was only Pause / Unpause. I tried pausing and unpausing, but the agents still refused to restart the deleted jobs.

exalate-issue-sync · 2022-06-14T18:53:44Z

WandB Internal User commented:
ariG23498 commented:
Hey @ExpectationMax
I tried reproducing the issue and these are my observations 👇

Code

I have tried out with a very basic example to showcase the features of resuming the grid sweep.
train.py

import wandb

# Set up your default hyperparameters before wandb.init
# so they get properly set in the sweep
hyperparameter_defaults = {
    'epoch':2
}

# Pass your defaults to wandb.init
wandb.init(config=hyperparameter_defaults)
config = wandb.config

# Log metrics inside your training loop
metrics = {'custom_metric': config["epoch"]/2}
wandb.log(metrics)

sweeps.yaml

program: train.py
method: grid
metric:
  name: custom_metric
  goal: minimize
parameters:
  epoch:
    values:
      - 2
      - 4

Steps

With both the train.py and sweeps.yaml at hand, I ran the sweep with the command $ wandb sweep sweeps.yaml
This gives me a command to run the sweep agent.
After I run the agent for the sweep my dashboard looks like this

Here we can see that both the runs have been logged with epoch values 2 and 4.
After the agent terminates, I do a ctrl-c in the terminal to close the sweep.
I go into the dashboard and delete the run that I want my agent to resume.

Here one can see that the first run has been deleted.
After the run has been deleted, I go into the control-page of the sweep and click on the resume button.
With the resume button clicked, I run the same sweep agent again from the terminal.
After the process terminates this is how my dashboard looks

I think the detailed walkthrough would make it feasible for you to combat your issues. Feel free to write in if this does not help you.

MBakirWB · 2022-06-23T23:35:55Z

@jvlmdr, can you please provide a link to your workspace for review. Thanks.

MBakirWB · 2022-06-29T18:28:08Z

@jvlmdr, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

jpcbertoldo · 2022-07-21T17:59:59Z

Question: if some runs fail or are deleted, is the (grid) sweep supposed to go through them again and relaunch?

MBakirWB · 2022-07-22T07:40:43Z

@jpcbertoldo , depends on the search strategy - if you were using a grid search, yes the agent will create a run for the exact same run configuration (since a grid search is an exhaustive search of your config). bayes and random search sample from a distribution, so there is a probability to run the same config, but for most configs to control a neural network, the chance of you reaching the same config is low - specially if you have a continuous distribution like normal or uniform defined in your config.

jpcbertoldo · 2022-07-22T09:46:37Z

So, for the record, I think there is a bug on that as well.
I launched a grid sweep and, as it was running, I deleted some runs that broke.
As it finished the last combinations it did not go back to the ones missing.

botcs · 2023-01-18T02:37:29Z

I bumped into the same issue.

puzzler10 · 2023-05-04T03:44:02Z

yeah, this is busted. same experience

al-stev · 2023-09-18T10:44:16Z

Have tested based on original feedback (#1787 (comment)) and this is working as intended.

ExpectationMax changed the title ~~Resuming failed grid sweep does not launch missing runs~~ Resuming grid sweep does not launch missing runs Feb 1, 2021

ariG23498 closed this as completed Mar 2, 2021

ariG23498 mentioned this issue Mar 2, 2021

[Q] Mechanism for rerunning failed runs in a sweep? #1879

Open

collinmccarthy mentioned this issue Apr 22, 2022

[Feature] Resume sweeps #2692

Open

exalate-issue-sync bot reopened this Jun 14, 2022

kptkin added the c:sweeps Component: Sweeps label Mar 1, 2023

al-stev closed this as completed Sep 18, 2023

saeejithnair mentioned this issue Nov 11, 2023

[App]: Resuming a grid sweep does not always rerun deleted runs #6594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resuming grid sweep does not launch missing runs #1787

Resuming grid sweep does not launch missing runs #1787

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Resuming grid sweep does not launch missing runs #1787

Resuming grid sweep does not launch missing runs #1787

Comments

Code

Steps

Uh oh!

Uh oh!

Code

Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!