-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Improved Dataset Merge Script for Multiple Dataset Processing #924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR addresses issue regarding merging, converting and editing datasets. The improved merge.py script provides robust functionality for combining multiple datasets with different dimensions, tasks, and indices. Key Improvements: 1、Multi-dataset Merging: Fixed the logic for merging datasets from different sources while preserving data integrity and continuity. 2、Dimension Handling: Added dynamic dimension detection and padding to ensure all observation and action vectors are consistently sized. The script now supports configurable maximum dimensions (default is 18, but can be overridden). 3. Index Consistency: Implemented continuous global frame indexing to avoid overlapping or gaps in indices after merging. 4、Task Mapping: Fixed task_index updates to ensure proper mapping across merged datasets with different task descriptions. 5、FPS Consistency: Added checks to ensure consistent FPS across datasets, with configurable default values. 6、Directory Structure: Improved output directory organization using chunk-based structure for better scalability. 7、Error Logging: Enhanced error reporting for failed files to aid debugging. Usage Example: # Define source folders and output folder source_folders = [ "/path/to/dataset1/", "/path/to/dataset2/", "/path/to/dataset3/" ] output_folder = "/path/to/merged_dataset/" # Merge the datasets with custom parameters merge_datasets( source_folders, output_folder, max_dim=32, # Set maximum dimension for observation.state and action default_fps=20 # Set default FPS if not specified in datasets )
Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.
for more information, see https://pre-commit.ci
Additionally, it can automatically pad dataset dimensions to a specified size, which is very useful for data alignment when training with multiple datasets, such as pi0. |
This commit fixes 8 linter warnings in the merge.py file, including: 1.Added contextlib import and used contextlib.suppress instead of the try-except-pass pattern 2.Removed unnecessary .keys() calls, using Pythonic way to iterate dictionaries directly 3.Renamed unused loop variables with underscore prefix (idx → _idx, dirs → _dirs, folder → _folder) 4. Combined nested if statements to improve code conciseness These changes maintain the same functionality while improving code quality and readability to conform to the project's coding standards.
for more information, see https://pre-commit.ci
Fixed code style issues to meet linter requirementsThis commit fixes 8 linter warnings in the merge.py file, including: |
lerobot/scripts/merge.py
Outdated
|
||
if __name__ == "__main__": | ||
# Define source folders and output folder | ||
source_folders = ["/path/to/put_plastic_to_box/", "/path/to/put_metal_to_box/", "/path/to/Find_blue_box"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please parse the args instead of hard code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will provide parsing the arguments instead of hardcoding; hardcoding is used in the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed the replacement of hardcoded values with arg parameters.
lerobot/scripts/merge.py
Outdated
output_folder = "/data1/realman/lerobot_merged_test/" | ||
|
||
# 设置默认FPS | ||
default_fps = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinions, two dataset with different fps can't merge into one, as generating a batch sample in train is related to timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, only datasets with the same FPS can be merged—sorry for not mentioning this earlier.
Modifications Introduced the argparse module to parse command-line arguments Converted previously hardcoded source folder paths, output folder path, max_dim, and FPS values to configurable command-line parameters Important Note Note: When merging datasets, the FPS for each collected task must be identical to ensure data synchronization and consistency
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
what will happen if I put all episode of different datasets in same folder, why would I need this feature can you explain to me please |
1、To save your time, I advise you not to do this. Otherwise, you will get incorrect frame_index (frame index in MP4), index (index in the entire episode), and task_index (task index) in the parquet file, along with unrecalculated stats.jsonl (including metadata like mean and variance). You might even fail to load the dataset, getting stuck at the "tolerance" step. 2、Using this script also allows padding collected episodes of varying lengths to a specified maximum dimension, which is highly flexible for later additions like dexterous hands, mobile bases, etc. 3、This script is much faster than re-adding frames and then consolidating—it can process 100+ episodes in just 5 seconds. Drawback: It cannot merge datasets with different FPS. |
This PR enhances the dataset merging functionality by improving the merge.py script with better handling of multiple datasets with different dimensions, tasks, and indices.
Key Improvements:
Fixed Task Index Mapping: Implemented proper task_index mapping across merged datasets to maintain consistent task references.
Consistent Frame Indexing: Added continuous global frame indexing calculation to ensure frames are properly ordered without overlaps.
Configurable Parameters: Replaced hardcoded values with variables:
Made vector dimensions configurable (default 18, now supports custom values)
Enhanced Error Reporting: Added detailed logging for failed files with specific error information to aid troubleshooting.
Video Statistics: Fixed total_videos counting and metadata updates.
Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.
This PR addresses the issue "How to merge/convert/edit datasets or train models with different datasets?" by providing robust tooling for combining datasets with different structures while maintaining data integrity.
`
Define source folders and output folder
source_folders = [
"/path/to/dataset1/",
"/path/to/dataset2/",
"/path/to/dataset3/"
]
output_folder = "/path/to/merged_dataset/"
Merge the datasets with custom parameters
merge_datasets(
source_folders,
output_folder,
max_dim=32, # Set maximum dimension for observation.state and action
default_fps=20 # Set default FPS if not specified in datasets
)
`
Issues Addressed:
Fixes #847 Is there a way Merge | Convert | Edit datasets function or a way how we can train model using different datasets ?
Addresses #912
Addressed #596 How to achieve multiple tasks on the basis of LeRobot ?
Fixed #542 Can we get MultiLeRobotDataset to support multiple LeRobotDataset with different schemas?
Fixed #533 How to merge multiple recorded datasets?