Improved Dataset Merge Script for Multiple Dataset Processing #924

TangGuohh · 2025-04-01T07:22:11Z

This PR enhances the dataset merging functionality by improving the merge.py script with better handling of multiple datasets with different dimensions, tasks, and indices.

Key Improvements:

Fixed Task Index Mapping: Implemented proper task_index mapping across merged datasets to maintain consistent task references.
Consistent Frame Indexing: Added continuous global frame indexing calculation to ensure frames are properly ordered without overlaps.
Configurable Parameters: Replaced hardcoded values with variables:
Made vector dimensions configurable (default 18, now supports custom values)
Enhanced Error Reporting: Added detailed logging for failed files with specific error information to aid troubleshooting.
Video Statistics: Fixed total_videos counting and metadata updates.
Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.

This PR addresses the issue "How to merge/convert/edit datasets or train models with different datasets?" by providing robust tooling for combining datasets with different structures while maintaining data integrity.

`

Define source folders and output folder

source_folders = [
"/path/to/dataset1/",
"/path/to/dataset2/",
"/path/to/dataset3/"
]

output_folder = "/path/to/merged_dataset/"

Merge the datasets with custom parameters

merge_datasets(
source_folders,
output_folder,
max_dim=32, # Set maximum dimension for observation.state and action
default_fps=20 # Set default FPS if not specified in datasets
)
`

Issues Addressed:
Fixes #847 Is there a way Merge | Convert | Edit datasets function or a way how we can train model using different datasets ?
Addresses #912
Addressed #596 How to achieve multiple tasks on the basis of LeRobot ？
Fixed #542 Can we get MultiLeRobotDataset to support multiple LeRobotDataset with different schemas?
Fixed #533 How to merge multiple recorded datasets?

This PR addresses issue regarding merging, converting and editing datasets. The improved merge.py script provides robust functionality for combining multiple datasets with different dimensions, tasks, and indices. Key Improvements: 1、Multi-dataset Merging: Fixed the logic for merging datasets from different sources while preserving data integrity and continuity. 2、Dimension Handling: Added dynamic dimension detection and padding to ensure all observation and action vectors are consistently sized. The script now supports configurable maximum dimensions (default is 18, but can be overridden). 3. Index Consistency: Implemented continuous global frame indexing to avoid overlapping or gaps in indices after merging. 4、Task Mapping: Fixed task_index updates to ensure proper mapping across merged datasets with different task descriptions. 5、FPS Consistency: Added checks to ensure consistent FPS across datasets, with configurable default values. 6、Directory Structure: Improved output directory organization using chunk-based structure for better scalability. 7、Error Logging: Enhanced error reporting for failed files to aid debugging. Usage Example: # Define source folders and output folder source_folders = [ "/path/to/dataset1/", "/path/to/dataset2/", "/path/to/dataset3/" ] output_folder = "/path/to/merged_dataset/" # Merge the datasets with custom parameters merge_datasets( source_folders, output_folder, max_dim=32, # Set maximum dimension for observation.state and action default_fps=20 # Set default FPS if not specified in datasets )

Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.

for more information, see https://pre-commit.ci

TangGuohh · 2025-04-01T07:33:54Z

Additionally, it can automatically pad dataset dimensions to a specified size, which is very useful for data alignment when training with multiple datasets, such as pi0.

This commit fixes 8 linter warnings in the merge.py file, including: 1.Added contextlib import and used contextlib.suppress instead of the try-except-pass pattern 2.Removed unnecessary .keys() calls, using Pythonic way to iterate dictionaries directly 3.Renamed unused loop variables with underscore prefix (idx → _idx, dirs → _dirs, folder → _folder) 4. Combined nested if statements to improve code conciseness These changes maintain the same functionality while improving code quality and readability to conform to the project's coding standards.

for more information, see https://pre-commit.ci

TangGuohh · 2025-04-02T06:31:17Z

Fixed code style issues to meet linter requirements

This commit fixes 8 linter warnings in the merge.py file, including:
1 Added contextlib import and used contextlib.suppress instead of the try-except-pass pattern
2 Removed unnecessary .keys() calls, using Pythonic way to iterate dictionaries directly
3 Renamed unused loop variables with underscore prefix (idx → _idx, dirs → _dirs, folder → _folder)
4 Combined nested if statements to improve code conciseness
These changes maintain the same functionality while improving code quality and readability to conform to the project's coding standards.

junlinp · 2025-04-10T02:31:37Z

lerobot/scripts/merge.py

+
+if __name__ == "__main__":
+    # Define source folders and output folder
+    source_folders = ["/path/to/put_plastic_to_box/", "/path/to/put_metal_to_box/", "/path/to/Find_blue_box"]


please parse the args instead of hard code.

I will provide parsing the arguments instead of hardcoding; hardcoding is used in the example.

Completed the replacement of hardcoded values with arg parameters.

junlinp · 2025-04-10T02:35:38Z

lerobot/scripts/merge.py

+    output_folder = "/data1/realman/lerobot_merged_test/"
+
+    # 设置默认FPS
+    default_fps = 20


In my opinions, two dataset with different fps can't merge into one, as generating a batch sample in train is related to timestamp.

Indeed, only datasets with the same FPS can be merged—sorry for not mentioning this earlier.

Modifications Introduced the argparse module to parse command-line arguments Converted previously hardcoded source folder paths, output folder path, max_dim, and FPS values to configurable command-line parameters Important Note Note: When merging datasets, the FPS for each collected task must be identical to ensure data synchronization and consistency

for more information, see https://pre-commit.ci

AbdElrahmanMostafaRifaat1432 · 2025-04-10T10:36:42Z

what will happen if I put all episode of different datasets in same folder, why would I need this feature can you explain to me please

TangGuohh · 2025-04-10T12:15:30Z

< 8000 task-lists disabled sortable>

what will happen if I put all episode of different datasets in same folder, why would I need this feature can you explain to me please

1、To save your time, I advise you not to do this. Otherwise, you will get incorrect frame_index (frame index in MP4), index (index in the entire episode), and task_index (task index) in the parquet file, along with unrecalculated stats.jsonl (including metadata like mean and variance). You might even fail to load the dataset, getting stuck at the "tolerance" step.

2、Using this script also allows padding collected episodes of varying lengths to a specified maximum dimension, which is highly flexible for later additions like dexterous hands, mobile bases, etc.

3、This script is much faster than re-adding frames and then consolidating—it can process 100+ episodes in just 5 seconds.

Drawback: It cannot merge datasets with different FPS.

Dongx1aoYang · 2025-04-24T05:20:15Z

Hi, I found that reading observation.image by pd.read_parquet from the parquet file converts the PILImage to dict, and it would cause an error in data proprocessing. I had to do the following modification to hf_transform_to_torch in lerobot/common/datasets/utils.py to resolve this issue:

TangGuohh and others added 3 commits April 1, 2025 15:06

Update merge.py

78e988c

Bilingual Documentation: Added comprehensive Chinese-English bilingual comments and function documentation to improve international usability.

[pre-commit.ci] auto fixes from pre-commit.com hooks

e750f73

for more information, see https://pre-commit.ci

imstevenpmwork added enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets labels Apr 1, 2025

TangGuohh mentioned this pull request Apr 1, 2025

How to merge multiple recorded datasets? #533

Open

TangGuohh and others added 3 commits April 2, 2025 13:44

Merge branch 'main' into main

a5d38c1

[pre-commit.ci] auto fixes from pre-commit.com hooks

1974bc7

for more information, see https://pre-commit.ci

TangGuohh added 2 commits April 5, 2025 15:24

Merge branch 'main' into main

476a70b

Merge branch 'main' into main

3218b71

junlinp reviewed Apr 10, 2025

View reviewed changes

TangGuohh and others added 5 commits April 10, 2025 12:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

391355c

for more information, see https://pre-commit.ci

Update merge.py

6b64bdf

[pre-commit.ci] auto fixes from pre-commit.com hooks

c4bc5f5

for more information, see https://pre-commit.ci

Merge branch 'main' into main

0e6dce2

Merge branch 'main' into main

707ddf5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Dataset Merge Script for Multiple Dataset Processing #924

Improved Dataset Merge Script for Multiple Dataset Processing #924

Improved Dataset Merge Script for Multiple Dataset Processing #924

Are you sure you want to change the base?

Improved Dataset Merge Script for Multiple Dataset Processing #924

Conversation

This PR enhances the dataset merging functionality by improving the merge.py script with better handling of multiple datasets with different dimensions, tasks, and indices.

Key Improvements:

This PR addresses the issue "How to merge/convert/edit datasets or train models with different datasets?" by providing robust tooling for combining datasets with different structures while maintaining data integrity.

Define source folders and output folder

Merge the datasets with custom parameters

Fixed code style issues to meet linter requirements

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment