Open
Description
Description
The checkpoining in Ray Train (CheckpointConfig) currently has the following options:
num_to_keep
checkpoint_score_attribute
checkpoint_score_order
checkpoint_frequency
checkpoint_at_end
It will be highly useful to add an option to keep the last_checkpoint
in addition to num_to_keep
.
Use case
In many scenarios, it is desired to keep the checkpoints with best metric. However, when training is interrupted (such as when there is only one worker spot instance and it gets terminated), it is required to restore from the latest checkpoint not the best one that is saved.