Can't unshard/convert trained checkpoint sharded with torch_new sharded checkpointer

I'm trying to unshard and load some trained OLMo2 base 13B checkpoints that were saved with the torch_new sharded checkpointer in the following format:

step100/
├── config.yaml                 # Model configuration file
├── train/                      # Directory containing rank files
│   ├── rank0.pt               # Rank 0 checkpoint file
│   ├── rank1.pt               # Rank 1 checkpoint file
│   └── ...                    # Additional rank files
└── model_and_optim/           # Directory containing distributed checkpoint files
    ├── .metadata
    ├── __0_0.distcp          # Distributed checkpoint file
    ├── __0_1.distcp          # Distributed checkpoint file          
    └── ...                    # Additional distcp files

I tried using scripts/unshard.py, but it seems to be incompatible with type torch_new, as it doesn't have an unshard_checkpoint function.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions