8000 Improved Dataset Merge Script for Multiple Dataset Processing by TangGuohh · Pull Request #924 · huggingface/lerobot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Improved Dataset Merge Script for Multiple Dataset Processing #924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

TangGuohh
Copy link
@TangGuohh TangGuohh commented Apr 1, 2025

This PR enhances the dataset merging functionality by improving the merge.py script with better handling of multiple datasets with different dimensions, tasks, and indices.

Key Improvements:

  1. Fixed Task Index Mapping: Implemented proper task_index mapping across merged datasets to maintain consistent task references.

  2. Consistent Frame Indexing: Added continuous global frame indexing calculation to ensure frames are properly ordered without overlaps.

  3. Configurable Parameters: Replaced hardcoded values with variables:
    Made vector dimensions configurable (default 18, now supports custom values)

  4. Enhanced Error Reporting: Added detailed logging for failed files with specific error information to aid troubleshooting.

  5. Video Statistics: Fixed total_videos counting and metadata updates.

  6. Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.

This PR addresses the issue "How to merge/convert/edit datasets or train models with different datasets?" by providing robust tooling for combining datasets with different structures while maintaining data integrity.

`

Define source folders and output folder

source_folders = [
"/path/to/dataset1/",
"/path/to/dataset2/",
"/path/to/dataset3/"
]

output_folder = "/path/to/merged_dataset/"

Merge the datasets with custom parameters

merge_datasets(
source_folders,
output_folder,
max_dim=32, # Set maximum dimension for observation.state and action
default_fps=20 # Set default FPS if not specified in datasets
)
`

Issues Addressed:
Fixes #847 Is there a way Merge | Convert | Edit datasets function or a way how we can train model using different datasets ?
Addresses #912
Addressed #596 How to achieve multiple tasks on the basis of LeRobot ?
Fixed #542 Can we get MultiLeRobotDataset to support multiple LeRobotDataset with different schemas?
Fixed #533 How to merge multiple recorded datasets?

TangGuohh and others added 3 commits April 1, 2025 15:06
This PR addresses issue regarding merging, converting and editing datasets. The improved merge.py script provides robust functionality for combining multiple datasets with different dimensions, tasks, and indices.
Key Improvements:
1、Multi-dataset Merging: Fixed the logic for merging datasets from different sources while preserving data integrity and continuity.
2、Dimension Handling: Added dynamic dimension detection and padding to ensure all observation and action vectors are consistently sized. The script now supports configurable maximum dimensions (default is 18, but can be overridden).
3. Index Consistency: Implemented continuous global frame indexing to avoid overlapping or gaps in indices after merging.
4、Task Mapping: Fixed task_index updates to ensure proper mapping across merged datasets with different task descriptions.
5、FPS Consistency: Added checks to ensure consistent FPS across datasets, with configurable default values.
6、Directory Structure: Improved output directory organization using chunk-based structure for better scalability.
7、Error Logging: Enhanced error reporting for failed files to aid debugging.
Usage Example:
# Define source folders and output folder
source_folders = [
    "/path/to/dataset1/", 
    "/path/to/dataset2/",
    "/path/to/dataset3/"
]

output_folder = "/path/to/merged_dataset/"

# Merge the datasets with custom parameters
merge_datasets(
    source_folders, 
    output_folder, 
    max_dim=32,  # Set maximum dimension for observation.state and action
    default_fps=20  # Set default FPS if not specified in datasets
)
Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.
@TangGuohh
Copy link
Author

Additionally, it can automatically pad dataset dimensions to a specified size, which is very useful for data alignment when training with multiple datasets, such as pi0.

TangGuohh and others added 3 commits April 2, 2025 13:44
This commit fixes 8 linter warnings in the merge.py file, including:
1.Added contextlib import and used contextlib.suppress instead of the try-except-pass pattern
2.Removed unnecessary .keys() calls, using Pythonic way to iterate dictionaries directly
3.Renamed unused loop variables with underscore prefix (idx → _idx, dirs → _dirs, folder → _folder)
4. Combined nested if statements to improve code conciseness
These changes maintain the same functionality while improving code quality and readability to conform to the project's coding standards.
@TangGuohh
Copy link
Author
TangGuohh commented Apr 2, 2025

Fixed code style issues to meet linter requirements

This commit fixes 8 linter warnings in the merge.py file, including:
1 Added contextlib import and used contextlib.suppress instead of the try-except-pass pattern
2 Removed unnecessary .keys() calls, using Pythonic way to iterate dictionaries directly
3 Renamed unused loop variables with underscore prefix (idx → _idx, dirs → _dirs, folder → _folder)
4 Combined nested if statements to improve code conciseness
These changes maintain the same functionality while improving code quality and readability to conform to the project's coding standards.


if __name__ == "__main__":
# Define source folders and output folder
source_folders = ["/path/to/put_plastic_to_box/", "/path/to/put_metal_to_box/", "/path/to/Find_blue_box"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please parse the args instead of hard code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will provide parsing the arguments instead of hardcoding; hardcoding is used in the example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed the replacement of hardcoded values with arg parameters.

output_folder = "/data1/realman/lerobot_merged_test/"

# 设置默认FPS
default_fps = 20
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinions, two dataset with different fps can't merge into one, as generating a batch sample in train is related to timestamp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, only datasets with the same FPS can be merged—sorry for not mentioning this earlier.

TangGuohh and others added 5 commits April 10, 2025 12:23
Modifications
Introduced the argparse module to parse command-line arguments
Converted previously hardcoded source folder paths, output folder path, max_dim, and FPS values to configurable command-line parameters
Important Note
Note: When merging datasets, the FPS for each collected task must be identical to ensure data synchronization and consistency
@AbdElrahmanMostafaRifaat1432

what will happen if I put all episode of different datasets in same folder, why would I need this feature can you explain to me please

@TangGuohh
Copy link
Author
TangGuohh commented Apr 10, 2025
< 8000 task-lists disabled sortable>

what will happen if I put all episode of different datasets in same folder, why would I need this feature can you explain to me please

1、To save your time, I advise you not to do this. Otherwise, you will get incorrect frame_index (frame index in MP4), index (index in the entire episode), and task_index (task index) in the parquet file, along with unrecalculated stats.jsonl (including metadata like mean and variance). You might even fail to load the dataset, getting stuck at the "tolerance" step.

2、Using this script also allows padding collected episodes of varying lengths to a specified maximum dimension, which is highly flexible for later additions like dexterous hands, mobile bases, etc.

3、This script is much faster than re-adding frames and then consolidating—it can process 100+ episodes in just 5 seconds.

Drawback: It cannot merge datasets with different FPS.

@Dongx1aoYang
Copy link

Hi, I found that reading observation.image by pd.read_parquet from the parquet file converts the PILImage to dict, and it would cause an error in data proprocessing. I had to do the following modification to hf_transform_to_torch in lerobot/common/datasets/utils.py to resolve this issue:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issues regarding data inputs, processing, or datasets enhancement Suggestions for new features or improvements
Projects
None yet
5 participants
0