8000 Resuming grid sweep does not launch missing runs · Issue #1787 · wandb/wandb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Resuming grid sweep does not launch missing runs #1787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ExpectationMax opened this issue Feb 1, 2021 · 11 comments
Closed

Resuming grid sweep does not launch missing runs #1787

ExpectationMax opened this issue Feb 1, 2021 · 11 comments
Labels
c:sweeps Component: Sweeps

Comments

@ExpectationMax
Copy link

Describe the bug
After removing some runs from a completed grid sweep and resuming the sweep, no new runs are passed to the agents and the sweep goes back into the complete state.

This is in contrast to the documentation which describes that the missing configurations will be launched when the sweep is resumed: https://docs.wandb.ai/sweeps/faq#rerun-grid-search

To Reproduce
Steps to reproduce the behavior:

  1. Create a sweep for example:
method: grid
parameters:
  batch_size:
    values:
    - 20
program: train.py
  1. Launch an agents for the sweep
  2. Remove the entry of the run when completed
  3. Resume the sweep
  4. Launch a new agent to run the missing sweep
  5. Wandb agent will give the following output
wandb: Starting wandb agent :sleuth_or_spy:
2021-02-01 21:32:51,562 - wandb.wandb_agent - INFO - Running runs: []
2021-02-01 21:32:51,915 - wandb.wandb_agent - INFO - Agent received command: exit
2021-02-01 21:32:51,915 - wandb.wandb_agent - INFO - Received exit command. Killing runs and quitting.
wandb: Terminating and syncing runs. Press ctrl-c to kill.
  1. The state of the sweep will go back to completed

Expected behavior
The missing sweep should be run by the agent after it has been deleted from the sweep.

@ExpectationMax ExpectationMax changed the title Resuming failed grid sweep does not launch missing runs Resuming grid sweep does not launch missing runs Feb 1, 2021
@ariG23498
Copy link
Contributor

Hey @ExpectationMax
I tried reproducing the issue and these are my observations 👇

Code

I have tried out with a very basic example to showcase the features of resuming the grid sweep.
train.py

import wandb

# Set up your default hyperparameters before wandb.init
# so they get properly set in the sweep
hyperparameter_defaults = {
    'epoch':2
}

# Pass your defaults to wandb.init
wandb.init(config=hyperparameter_defaults)
config = wandb.config

# Log metrics inside your training loop
metrics = {'custom_metric': config["epoch"]/2}
wandb.log(metrics)

sweeps.yaml

program: train.py
method: grid
metric:
  name: custom_metric
  goal: minimize
parameters:
  epoch:
    values:
      - 2
      - 4

Steps

  1. With both the train.py and sweeps.yaml at hand, I ran the sweep with the command $ wandb sweep sweeps.yaml
  2. This gives me a command to run the sweep agent.
  3. After I run the agent for the sweep my dashboard looks like this
    before_deletion
    Here we can see that both the runs have been logged with epoch values 2 and 4.
  4. After the agent terminates, I do a ctrl-c in the terminal to close the sweep.
  5. I go into the dashboard and delete the run that I want my agent to resume.
    after_deletion
    Here one can see that the first run has been deleted.
  6. After the run has been deleted, I go into the control-page of the sweep and click on the resume button.
  7. With the resume button clicked, I run the same sweep agent again from the terminal.
  8. After the process terminates this is how my dashboard looks
    run_the_sweep_again

I think the detailed walkthrough would make it feasible for you to combat your issues. Feel free to write in if this does not help you.

@jvlmdr
Copy link
jvlmdr commented Jun 14, 2022

When following these instructions, I did not see a "Resume" button on the sweep control page. There was only Pause / Unpause. I tried pausing and unpausing, but the agents still refused to restart the deleted jobs.

@exalate-issue-sync exalate-issue-sync bot reopened this Jun 14, 2022
@exalate-issue-sync
Copy link

WandB Internal User commented:
ariG23498 commented:
Hey @ExpectationMax
I tried reproducing the issue and these are my observations 👇

Code

I have tried out with a very basic example to showcase the features of resuming the grid sweep.
train.py

import wandb

# Set up your default hyperparameters before wandb.init
# so they get properly set in the sweep
hyperparameter_defaults = {
    'epoch':2
}

# Pass your defaults to wandb.init
wandb.init(config=hyperparameter_defaults)
config = wandb.config

# Log metrics inside your training loop
metrics = {'custom_metric': config["epoch"]/2}
wandb.log(metrics)

sweeps.yaml

program: train.py
method: grid
metric:
  name: custom_metric
  goal: minimize
parameters:
  epoch:
    values:
      - 2
      - 4

Steps

  1. With both the train.py and sweeps.yaml at hand, I ran the sweep with the command $ wandb sweep sweeps.yaml
  2. This gives me a command to run the sweep agent.
  3. After I run the agent for the sweep my dashboard looks like this
    before_deletion
    Here we can see that both the runs have been logged with epoch values 2 and 4.
  4. After the agent terminates, I do a ctrl-c in the terminal to close the sweep.
  5. I go into the dashboard and delete the run that I want my agent to resume.
    after_deletion
    Here one can see that the first run has been deleted.
  6. After the run has been deleted, I go into the control-page of the sweep and click on the resume button.
  7. With the resume button clicked, I run the same sweep agent again from the terminal.
  8. After the process terminates this is how my dashboard looks
    run_the_sweep_again

I think the detailed walkthrough would make it feasible for you to combat your issues. Feel free to write in if this does not help you.

@MBakirWB
Copy link

@jvlmdr, can you please provide a link to your workspace for review. Thanks.

@MBakirWB
Copy link

@jvlmdr, since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

@jpcbertoldo
Copy link

Question: if some runs fail or are deleted, is the (grid) sweep supposed to go through them again and relaunch?

@MBakirWB
Copy link

@jpcbertoldo , depends on the search strategy - if you were using a grid search, yes the agent will create a run for the exact same run configuration (since a grid search is an exhaustive search of your config). bayes and random search sample from a distribution, so there is a probability to run the same config, but for most configs to control a neural network, the chance of you reaching the same config is low - specially if you have a continuous distribution like normal or uniform defined in your config.

@jpcbertoldo
Copy link
8000

So, for the record, I think there is a bug on that as well.
I launched a grid sweep and, as it was running, I deleted some runs that broke.
As it finished the last combinations it did not go back to the ones missing.

@botcs
Copy link
botcs commented Jan 18, 2023

I bumped into the same issue.

@kptkin kptkin added the c:sweeps Component: Sweeps label Mar 1, 2023
@puzzler10
Copy link

yeah, this is busted. same experience

@al-stev al-stev closed this as completed Sep 18, 2023
@al-stev
Copy link
al-stev commented Sep 18, 2023

Have tested based on original feedback (#1787 (comment)) and this is working as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:sweeps Component: Sweeps
Projects
None yet
Development

No branches or pull requests

9 participants
0