Multi-GPU Dataloader sampling

First, thank you for this great work and making it available to the larger robotics community!

I may be missing a detail in the RLDSDataset, but I believe there is a bug in the data sampling for multi-gpu training jobs using torchrun.

The dataloader is initialized in the finetune.py script as follows:

dataloader = DataLoader(
        train_dataset,
        batch_size=cfg.batch_size,
        sampler=None,
        collate_fn=collator,
        num_workers=0,  # Important: Set to 0 if using RLDS, which uses its own parallelism
    )

With no sampler specified, it 52C6 will default to a SequentialSampler. With torchrun each GPU will have a copy of this dataloader and iterate over the same indices.

However, I do not see a seed being set during finetuning, so is the assumption that the shuffling that occurs in the make_interleaved_dataset() function will inherently randomize the indices sampled across GPUs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions