8000 Multi-GPU Dataloader sampling · Issue #51 · moojink/openvla-oft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Multi-GPU Dataloader sampling #51
Open
@NigelNelson

Description

@NigelNelson

First, thank you for this great work and making it available to the larger robotics community!

I may be missing a detail in the RLDSDataset, but I believe there is a bug in the data sampling for multi-gpu training jobs using torchrun.

The dataloader is initialized in the finetune.py script as follows:

dataloader = DataLoader(
        train_dataset,
        batch_size=cfg.batch_size,
        sampler=None,
        collate_fn=collator,
        num_workers=0,  # Important: Set to 0 if using RLDS, which uses its own parallelism
    )

With no sampler specified, it 52C6 will default to a SequentialSampler. With torchrun each GPU will have a copy of this dataloader and iterate over the same indices.

However, I do not see a seed being set during finetuning, so is the assumption that the shuffling that occurs in the make_interleaved_dataset() function will inherently randomize the indices sampled across GPUs?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0