Remove offline training, refactor `train.py` and logging/checkpointing #670

aliberts · 2025-01-31T19:29:13Z

What this does

⚠️ Removes the offline training part from the train.py script: online training will be handled by the training scripts from Port HIL SERL #644
In consequence, .offline and .online are removed from TrainPipelineConfig. To set the number of offline training step, simply use --steps:

python lerobot/scripts/train.py \
- --offline.steps=200000
+ --steps=200000

Adds wandb_utils.py and turns Logger into WandBLogger to remove responsibilities from this class so that it only manages wandb stuff.
Replaces training_state serialization with torch.save/load to safetensors.save_file/load_file. We shouldn't use torch.load() for this and in fact it breaks in 2.6 due to weights_only=True by default.

/checkpoints/005000
  ├── pretrained_model
- └── training_state.pth
+ └── training_state
+     ├── optimizer_param_groups.json
+     ├── optimizer_state.safetensors
+     ├── rng_state.safetensors
+     ├── scheduler_state.json
+     └── training_step.json

Adds train_utils.py to handle training checkpoints logic (including training state).
Cleans up functions related to rng and groups them together in random_utils.py.
Save checkpoint before eval during training rather than after (safer in case eval crashes)
Fixes logging where displayed values would only be the last one measured instead of the average over the steps from previous logging step.
Changed the policies main forward() output format for clarity. It now returns a tuple[Tensor, dict | None] instead of just a dict, the first element being the loss:

- output_dict = policy.forward(batch)
- loss = output_dict["loss"]
+ loss, output_dict = policy.forward(batch)
loss.backward()

How it was tested

Adds the following tests:

tests/test_schedulers.py
tests/test_optimizers.py
tests/test_train_utils.py
tests/test_random_utils.py
tests/test_io_utils.py

How to checkout & try? (for the reviewer)

Examples:

pytest -v \
    tests/test_schedulers.py \
    tests/test_optimizers.py \
    tests/test_train_utils.py \
    tests/test_random_utils.py \
    tests/test_io_utils.py

…_31_safetensors_training_state

Cadene

Beautiful

Could you remove all appearance of ema?
There were added by default

lerobot/common/policies/diffusion/modeling_diffusion.py

Co-authored-by: Remi <remi.cadene@huggingface.co>

huggingface#670) Co-authored-by: Remi <remi.cadene@huggingface.co>

aliberts added 4 commits January 31, 2025 20:22

Add random_utils

1b3123f

Update training_state serialization to safetensors

14e8a7f

Move functions to random_utils

700e08a

Add tests

ed5f38e

aliberts changed the title ~~Update safetensors `training_state~~ Update training_state serialization to safetensors Jan 31, 2025

aliberts changed the title ~~Update training_state serialization to safetensors~~ Refactor Logger Feb 4, 2025

aliberts added 7 commits February 8, 2025 13:21

Tuplify policy.forward() outputs

17dd853

Split Logger responsibilities, use safetensors for checkpoints

3327e70

Refactor train.py, remove online part

2cbf61d

Update integration tests

0320fed

Update examples

b73872e

Update test_record_and_replay_and_policy

799813a

Merge remote-tracking branch 'origin/main' into user/aliberts/2025_01…

a4be431

…_31_safetensors_training_state

aliberts changed the title ~~Refactor Logger~~ Refactor train.py and logging/checkpointing Feb 8, 2025

Simplify MetricsTracker and add test_logging_utils.py

6169747

aliberts changed the title ~~Refactor train.py and logging/checkpointing~~ Remove offline training, refactor train.py and logging/checkpointing Feb 8, 2025

aliberts added the refactor Code cleanup or restructuring without changing behavior label Feb 8, 2025

aliberts added 13 commits February 8, 2025 16:16

Fix draccus version

80ab9a5

Simplify train_utils

18faaad

Update docs

326afd7

Add copyrights

5e7d083

Update save_checkpoint

af4bfc8

Add test_train_utils

81fe1d5

Add deserialize_json_into_object and testing

93754ee

Update optimizer deserialization with proper typing

8eb8301

Update scheduler deserialization with proper typing

c3a40a2

Add test_optimizers

a8e3336

Add test_scheduler

f831d8b

Fix poetry relax

7fa6817

Add fixtures

8000

780fbf5

aliberts added 2 commits February 8, 2025 22:36

Fix tests

8b528a5

Nit docstring

0023b08

aliberts requested a review from Cadene February 8, 2025 21:48

aliberts marked this pull request as ready for review February 8, 2025 21:48

Cadene approved these changes Feb 10, 2025

View reviewed changes

lerobot/common/policies/diffusion/modeling_diffusion.py Show resolved Hide resolved

aliberts and others added 2 commits February 11, 2025 10:08

Add suggestion from code review

2f2ab8f

Co-authored-by: Remi <remi.cadene@huggingface.co>

Remove ema references

51c7bb2

aliberts merged commit 90e099b into main Feb 11, 2025
7 checks passed

aliberts deleted the user/aliberts/2025_01_31_safetensors_training_state branch February 11, 2025 09:36

aliberts added a commit that referenced this pull request Feb 12, 2025

Fixes following #670 (#719)

e710959

JIy3AHKO pushed a commit to vertix/lerobot that referenced this pull request Feb 27, 2025

Remove offline training, refactor train.py and logging/checkpointing (

397d00b

huggingface#670) Co-authored-by: Remi <remi.cadene@huggingface.co>

JIy3AHKO pushed a commit to vertix/lerobot that referenced this pull request Feb 27, 2025

Fixes following huggingface#670 (huggingface#719)

386f3ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove offline training, refactor `train.py` and logging/checkpointing #670

Remove offline training, refactor `train.py` and logging/checkpointing #670

Remove offline training, refactor train.py and logging/checkpointing #670

Remove offline training, refactor train.py and logging/checkpointing #670

Conversation

What this does

How it was tested

How to checkout & try? (for the reviewer)

Choose a reason for hiding this comment

Remove offline training, refactor `train.py` and logging/checkpointing #670

Remove offline training, refactor `train.py` and logging/checkpointing #670