-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: add remove_episodes
utility
#831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@bsprenger Awesome stuff here! The logic here makes sense, and seems to capture all of the aspects of the dataset to be modified. The test also works for me, and I tried this on a local dataset of mine and got the expected results. Two comments:
Let me know your thoughts! |
62ada9b
to
22da733
Compare
Hey @brysonjones thanks for the review! I have made it a CLI tool and moved it to
Through the CLI, then you could pass Let me know what you think! |
great work! worked very well for me |
This worked well for me! @bsprenger Should provide a good base to expand some other capabilities going forward @Cadene Any thoughts from you on this? |
Thanks for doing this @bsprenger ! Worked great for me as well!
|
@brysonjones @bsprenger Really cool work. I have one comment ;) Did you measure the time it takes? What I had in mind is to not modify the What do you think? |
22da733
to
426fbc4
Compare
@Cadene Ah, now I understand the motivation for why the dataset v2 split each episode up into individual files! Indeed it's way faster to not reindex, and simply filter out the removed episode files. And it should scale well for large datasets. I modified my commit accordingly. @liuhuanjim013 Thanks for testing! Indeed, I did not automatically update the remote. I thought it would be better to intentionally push the dataset afterwards, when satisfied that the removal is what you wanted. I think there exists a script to do that? Remi do you have thoughts on this? |
@Cadene Yeah, I think this makes sense! @bsprenger @liuhuanjim013 Perhaps this could be another flag on the script to just push to the hub, similar to how the dataset collection script has that flag? |
+1 to the flag idea! |
426fbc4
to
59b97da
Compare
@brysonjones @liuhuanjim013 I added some flags for pushing to the hub. Let me know what you think! I also added a few lines to update the dataset card. |
@bsprenger I tried it and checked the files online (I tried to remove episode 15-17 from a dataset of 50 episodes):
Then I tried to visualize the dataset locally:
|
59b97da
to
7fe463b
Compare
@liuhuanjim013 Thanks for catching these! I reproduced them and fixed them. Hub Issue: Instead of a full sync function, I explicitly remove the target episodes' data/video files from the hub. I tend to prefer this very explicit solution to avoid too much hidden behaviour which could result from a full sync. However, this means that if your local and remote folders are out of sync right now, you may need to delete the remote and reupload the local once. After that, the removal script should keep them aligned. Visualization Scripts: |
@liuhuanjim013 did you get a chance to try this? 🙂 |
@bsprenger tested on a different repo and it works! so it should be good for anyone to try the latest version! |
Hey @Cadene, any thoughts on this? 🙂 any more feedback to get this merged? |
i tested on the old(broken) dataset, did the following to fix it:
one thing i noticed using the visualization script: when using down arrow key to see the next episode, i would get 500 Internal Server Error when navigating from the episode (e.g. 14) before the removed one (e.g. 15), it tries to load the removed episode (e.g. 15) and gets into error. the main page does not show the removed episodes (e.g. 15, 16, 17, 19). |
added a fix (#943) |
I tried to delete an episode ( 306 from Chojins/chess_game_001_blue_stereo) and it appeared to work but now visualising the dataset shows the following error on episode 306 (presumably the old episode 307) Internal Server Error I tried recording another episode to see if that would fix it but get the following error when trying to push to the hub: Traceback (most recent call last): Can I repair my dataset somehow? |
This commit introduces a remove_episodes function/CLI tool to remove specific episodes from a dataset, and will automatically modify all required data, video, and metadata. Optionally, the script will push the modified dataset to the hub (True by default). The function will safely remove the episodes, meaning that if at any point during the process a failure occurs, the original dataset is preserved. Additionally, the original dataset is optionally backed up in case it is needed to revert to.
When episodes are removed from a LeRobotDataset, the remaining episode indices are no longer sequential, which causes indexing errors in get_episode_data(). This happens because episode_data_index tensors are always indexed sequentially, while the episode indices can be arbitrary. This commit introduces a helper function to make the conversion.
eacb1fc
to
997133c
Compare
Hi @Chojins, I assume you are using the online visualization tool. Indeed, it will show an error, because the online tool would need the changes from this PR to be merged in order to work. If you use this PR's branch and run the visualizer locally, it should work: python lerobot/scripts/visualize_dataset_html.py --repo-id=chojins/chess_game_001_blue_stereo When I run this, it appears that the episode was correctly removed! However, thanks to you I did realize that there was another issue with the I updated the script yesterday to fix this going forward, but you will need to repair your dataset on the hub. Here's how you can do it. from huggingface_hub import HfApi
hub_api = HfApi()
# we have to delete the old tag, which points to an older commit of the dataset
hub_api.delete_tag('chojins/chess_game_001_blue_stereo', tag='v2.1', repo_type='dataset')
# then we create a new tag which points to the most recent commit (i.e. the ones that deleted episodes)
hub_api.create_tag('chojins/chess_game_001_blue_stereo', tag='v2.1', revision="main", repo_type="dataset") |
Hi @bsprenger , thanks very much for the great work. I got a new question during tring your script. If I use the remove script to delete some episodes from the head or middle of a dataset, can I still resume to collect new episodes based on this reduced dataset? (using python control_robot.py --control.type=record --control.resume=true) I observed that deleting episodes will reduce the values of "meta.info.total_episodes" and "meta.info.total_frames", and these two parameters will affect the global index value on the newly added episodes when resume to record data. This may lead index overlap on (parquet file names,meta.episode,"episode_index" key and "index" keys) with previous episodes. For example, if I have a dataset with 10 episodes and delete episode_index=1,2,3, then the dataset's meta.episode becomes [0, 4, 5, 6, 7, 8, 9] and meta.info.total_episodes=7. When a new episode is added to this reduced dataset, its episode_index=meta.info.total_episodes=7, which overlap with the existing episode_index=7 in the dataset. This overlap will exist on (parquet file name,meta.episode,"episode_index" key,"index" keys), which will lead to visualization or training failure. Is there any way to resolve this overlap? I often encounter situations where I need to clean the dataset and then supplement it with additional new data. |
What this does (✨ Feature)
This commit introduces a
remove_episodes
function to remove specific episodes from a dataset, and will automatically modify all required data, video, and metadata.The function will safely remove the episodes, meaning that if at any point during the process a failure occurs, the original dataset is preserved. Additionally, even if there is no error, the original dataset is backed up locally (timestamped) in case it is needed to revert to.
How it was tested
test_remove_episodes
intests/test_datasets.py
.lerobot/aloha_mobile_cabinet
dataset, manual inspection of the files and dataset seemed correct.How to checkout & try? (for the reviewer)