[Q]: Fine-tuning best practices #9877

echen1214 · 2025-05-19T22:47:21Z

Ask your question

Hi all, What are some best practices for using wandb to log multiple rounds of fine-tuning? For example, let's say I have one run logged for 100 epoch. I want to spawn 3 fine-tuning runs with different hyperparameters or data that start from the last epoch of the initial run. I've used wandb.init("id":<id>,"resume":"must") to resume runs, but this doesn't allow me multiple fine-tuning runs.

Thanks for the help!

The text was updated successfully, but these errors were encountered:

JasonArkens17 · 2025-05-19T23:02:44Z

Hi @echen1214,

Thanks for reaching out! I understand you're looking to create multiple fine-tuning runs that start from the same checkpoint of an initial run.

Based on your description, you're trying to:

Complete an initial run for 100 epochs
Create 3 different fine-tuning runs starting from that 100th epoch
Use different hyperparameters or data for each of these new runs

The "resume" functionality you've been using (wandb.init("id":<id>,"resume":"must")) is designed to continue the exact same run, which is why it doesn't support multiple branching runs. What you're trying to do is actually a perfect use case for our "fork" feature!

The forking feature allows you to create new runs that branch off from a specific point in an existing run. This lets you explore different parameters or models without affecting the original run - exactly what you need!

Here's how you would implement it:

import wandb

# Assuming your initial run is complete:
original_run_id = "your_initial_run_id"

# Create your first fine-tuning run branching from epoch 100
fine_tune_run1 = wandb.init(
    project="your_project_name",
    fork_from=f"{original_run_id}?_step=100",  # Specify step/epoch to fork from
    # Add your new hyperparameters for this specific fine-tuning run
    config={"learning_rate": 0.001, "batch_size": 32}
)

# Continue training with new hyperparameters
# ...

# Similarly for your other fine-tuning runs with different configs

The forking feature is currently in private preview at the moment.

Alternative Approach
If you need an immediate solution while waiting for fork access, you could:

Save model checkpoints at epoch 100 of your initial run
Start completely new runs that load this checkpoint
Use W&B's Groups feature to organize these related runs

Let me know if you have any questions or if you'd like more details on either approach!

Best,
Jason

echen1214 · 2025-05-20T22:22:13Z

Hi Jason, thanks for the tip. I've implemented the alternative approach and will take keep a lookout on the forking feature.

echen1214 added the ty:question type of issue is a question label May 19, 2025

echen1214 closed this as completed May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Q]: Fine-tuning best practices #9877

[Q]: Fine-tuning best practices #9877

Uh oh!

Uh oh!

[Q]: Fine-tuning best practices #9877

[Q]: Fine-tuning best practices #9877

Comments

Ask your question

Uh oh!

Uh oh!