-
Notifications
You must be signed in to change notification settings - Fork 2.9k
FileNotFoundError during checkpoint saving in nemo_model_checkpoint.py #13581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mujhenahiata Could you share more detailed reproduction steps, including the configuration (excluding the dataset)? That would be helpful. |
following is the Hardware reproduction steps
Trained the model using the speech_to_text_ctc_bpe.py script. |
You just provided config but not full reproducing step.
This usually happens when your script doesn;t have permissions to create the file/respective folder. Would appreciate:
|
Uh oh!
There was an error while loading. Please reload this page.
When training a speech-to-text model using NeMo and PyTorch Lightning, the training crashes during the validation phase due to a FileNotFoundError while attempting to remove an older .nemo checkpoint file.
Environment:
NeMo version: (2.2.0)
Python version: 3.10
Reproduction Steps:
nemo/utils/callbacks
File "nemo_model_checkpoint.py", line 246, in on_save_checkpoint
get_filesystem(backup_path).rm(backup_path)
@titu1994
i have just updated it , there is some issue with how ddp and global ranks are going into some race condition. it works as expected on a single GPU, but crashes on a cluster.
should i raise a PR ?
The text was updated successfully, but these errors were encountered: