Open
Description
First, thank you for this great work and making it available to the larger robotics community!
I may be missing a detail in the RLDSDataset, but I believe there is a bug in the data sampling for multi-gpu training jobs using torchrun
.
The dataloader is initialized in the finetune.py script as follows:
dataloader = DataLoader(
train_dataset,
batch_size=cfg.batch_size,
sampler=None,
collate_fn=collator,
num_workers=0, # Important: Set to 0 if using RLDS, which uses its own parallelism
)
With no sampler specified, it
52C6
will default to a SequentialSampler. With torchrun
each GPU will have a copy of this dataloader and iterate over the same indices.
However, I do not see a seed being set during finetuning, so is the assumption that the shuffling that occurs in the make_interleaved_dataset()
function will inherently randomize the indices sampled across GPUs?
Metadata
Metadata
Assignees
Labels
No labels