8000 feat: add `remove_episodes` utility by bsprenger · Pull Request #831 · huggingface/lerobot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: add remove_episodes utility #831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bsprenger
Copy link
Contributor

What this does (✨ Feature)

This commit introduces a remove_episodes function to remove specific episodes from a dataset, and will automatically modify all required data, video, and metadata.

The function will safely remove the episodes, meaning that if at any point during the process a failure occurs, the original dataset is preserved. Additionally, even if there is no error, the original dataset is backed up locally (timestamped) in case it is needed to revert to.

How it was tested

  • Added test_remove_episodes in tests/test_datasets.py.
  • Tested manually on the lerobot/aloha_mobile_cabinet dataset, manual inspection of the files and dataset seemed correct.

How to checkout & try? (for the reviewer)

pytest -sx tests/test_datasets.py::test_remove_episodes

@brysonjones
Copy link

@bsprenger Awesome stuff here!

The logic here makes sense, and seems to capture all of the aspects of the dataset to be modified.

The test also works for me, and I tried this on a local dataset of mine and got the expected results.

Two comments:

  • I think this would strongly benefit from being a CLI tool though that can be called similar to the tools to visualize the datasets in lerobot/scripts. This is what I was working on on my branch
  • The backup directory that's generated appears to not be ephemeral, but I believe this cleanup should happen automatically (or get dumped into a hidden folder like .tmp)
    • The alternative is that there could be an additional flag on the CLI tool to explicitly keep the backup, with the default being not to keep it

Let me know your thoughts!

@bsprenger bsprenger force-pushed the feat/remove-episodes branch 3 times, most recently from 62ada9b to 22da733 Compare March 9, 2025 17:17
@bsprenger
Copy link
Contributor Author

Hey @brysonjones thanks for the review!

I have made it a CLI tool and moved it to scripts.
For the backup, I added an optional argument backup which can be:

  • True: will backup to a default location
  • a string or path to backup to
  • False (default) - will not backup

Through the CLI, then you could pass --backup to backup to the default location or --backup /some/path for somewhere specific. What do you think? The default location could definitely be up for debate (I just put it as a timestamped folder beside the original for now).

Let me know what you think!

@The-Michael-Chen
Copy link

great work! worked very well for me

@brysonjones
Copy link

This worked well for me! @bsprenger

Should provide a good base to expand some other capabilities going forward

@Cadene Any thoughts from you on this?

@liuhuanjim013
Copy link
Contributor
liuhuanjim013 commented Mar 10, 2025

Thanks for doing this @bsprenger ! Worked great for me as well!
I tried it and noticed a few things:

  1. I removed an episode using your script
  2. Checked locally it was indeed removed. But remote was not updated (by design? I don't see push_to_hub in the code)
  3. Resumed recording of 2 more episodes and confirmed the files online (info.json looks correct, dataset card not updated?). When I tried to resume recording locally, I got metadata compatibility check failure. Is it due to updated dataset not getting pushed to remote? (Turned out that I forgot to update the robot_devices/robots/configs.py!)

@Cadene
Copy link
Collaborator
Cadene commented Mar 10, 2025

@brysonjones @bsprenger Really cool work. I have one comment ;)

Did you measure the time it takes?

What I had in mind is to not modify the hf_dataset.
It's ok if the episode indices of a dataset are not consecutive numbers ; same for the task indices.
As a result, deleting an episode will scale to any dataset size.

What do you think?

@imstevenpmwork imstevenpmwork added the enhancement Suggestions for new features or improvements label Mar 10, 2025
@bsprenger bsprenger force-pushed the feat/remove-episodes branch from 22da733 to 426fbc4 Compare March 11, 2025 11:15
@bsprenger
Copy link
Contributor Author

@Cadene Ah, now I understand the motivation for why the dataset v2 split each episode up into individual files!

Indeed it's way faster to not reindex, and simply filter out the removed episode files. And it should scale well for large datasets. I modified my commit accordingly.

@liuhuanjim013 Thanks for testing! Indeed, I did not automatically update the remote. I thought it would be better to intentionally push the dataset afterwards, when satisfied that the removal is what you wanted. I think there exists a script to do that? Remi do you have thoughts on this?

@brysonjones
Copy link

@Cadene Yeah, I think this makes sense!

@bsprenger @liuhuanjim013 Perhaps this could be another flag on the script to just push to the hub, similar to how the dataset collection script has that flag?

@liuhuanjim013
Copy link
Contributor

+1 to the flag idea!

@bsprenger bsprenger force-pushed the feat/remove-episodes branch from 426fbc4 to 59b97da Compare March 14, 2025 15:23
@bsprenger
Copy link
Contributor Author

@brysonjones @liuhuanjim013 I added some flags for pushing to the hub. Let me know what you think!

I also added a few lines to update the dataset card.

@liuhuanjim013
Copy link
Contributor
liuhuanjim013 commented Mar 15, 2025

@bsprenger I tried it and checked the files online (I tried to remove episode 15-17 from a dataset of 50 episodes):

  1. episodes,jsonl, episodes_stats.jsonl, info.json all seem to be updated
  2. datacard also seems to be updated with the correct video numbers
  3. the removed videos are still in the videos folders

Then I tried to visualize the dataset locally:

  1. The total number of videos are indeed reduced (from 50 to 47)
  2. But I could still see the links to the removed episodes: when clicking on them, I got "Internal Server Error" (I removed episodes 15-17, the links are still there)
  3. The links to the last 3 episodes are gone (episode 47-49)
    Could you take a look and see if you have the same problem?

@bsprenger bsprenger force-pushed the feat/remove-episodes branch from 59b97da to 7fe463b Compare March 16, 2025 15:53
@bsprenger
Copy link
Contributor Author

@liuhuanjim013 Thanks for catching these! I reproduced them and fixed them.

Hub Issue:
I initially overlooked that pushing doesn’t automatically remove files missing from the local dataset. This is now fixed.

Instead of a full sync function, I explicitly remove the target episodes' data/video files from the hub. I tend to prefer this very explicit solution to avoid too much hidden behaviour which could result from a full sync. However, this means that if your local and remote folders are out of sync right now, you may need to delete the remote and reupload the local once. After that, the removal script should keep them aligned.

Visualization Scripts:
The issue stemmed from the scripts' assumption that episode indices are sequential, which is now fixed. I’ve updated all instances I could find, but if you spot any more, let me know!

@bsprenger
Copy link
Contributor Author

@liuhuanjim013 did you get a chance to try this? 🙂

@liuhuanjim013
Copy link
Contributor

@bsprenger tested on a different repo and it works! so it should be good for anyone to try the latest version!
for the one that had trouble, i'll need to look further to see how to recover it

@bsprenger
Copy link
Contributor Author

Hey @Cadene, any thoughts on this? 🙂 any more feedback to get this merged?

@liuhuanjim013
Copy link
Contributor

i tested on the old(broken) dataset, did the following to fix it:

  1. removed (renamed) dataset on hugging face
  2. used latest remove episode script to remove a new episode (e.g. 19) and push to hub

one thing i noticed using the visualization script: when using down arrow key to see the next episode, i would get 500 Internal Server Error when navigating from the episode (e.g. 14) before the removed one (e.g. 15), it tries to load the removed episode (e.g. 15) and gets into error. the main page does not show the removed episodes (e.g. 15, 16, 17, 19).

@liuhuanjim013
Copy link
Contributor

one thing i noticed using the visualization script: when using down arrow key to see the next episode, i would get 500 Internal Server Error when navigating from the episode (e.g. 14) before the removed one (e.g. 15), it tries to load the removed episode (e.g. 15) and gets into error. the main page does not show the removed episodes (e.g. 15, 16, 17, 19).

added a fix (#943)

@Chojins
Copy link
Chojins commented Apr 13, 2025

I tried to delete an episode ( 306 from Chojins/chess_game_001_blue_stereo) and it appeared to work but now visualising the dataset shows the following error on episode 306 (presumably the old episode 307)

Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

I tried recording another episode to see if that would fix it but get the following error when trying to push to the hub:

Traceback (most recent call last):
File "/home/jacob/lerobot/lerobot/scripts/control_robot.py", line 393, in
control_robot()
File "/home/jacob/lerobot/lerobot/configs/parser.py", line 227, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/jacob/lerobot/lerobot/scripts/control_robot.py", line 378, in control_robot
record(robot, cfg.control)
File "/home/jacob/lerobot/lerobot/common/robot_devices/utils.py", line 42, in wrapper
raise e
File "/home/jacob/lerobot/lerobot/common/robot_devices/utils.py", line 38, in wrapper
return func(robot, *args, **kwargs)
File "/home/jacob/lerobot/lerobot/scripts/control_robot.py", line 322, in record
dataset.save_episode()
File "/home/jacob/lerobot/lerobot/common/datasets/lerobot_dataset.py", line 897, in save_episode
assert len(video_files) == self.num_episodes * len(self.meta.video_keys)
AssertionError

Can I repair my dataset somehow?

This commit introduces a remove_episodes function/CLI tool to remove
specific episodes from a dataset, and will automatically modify
all required data, video, and metadata. Optionally, the script will
push the modified dataset to the hub (True by default).

The function will safely remove the episodes, meaning that if at
any point during the process a failure occurs, the original dataset
is preserved. Additionally, the original dataset is optionally backed up
in case it is needed to revert to.
When episodes are removed from a LeRobotDataset, the remaining
episode indices are no longer sequential, which causes indexing errors
in get_episode_data(). This happens because episode_data_index tensors
are always indexed sequentially, while the episode indices can be
arbitrary. This commit introduces a helper function to make the
conversion.
@bsprenger bsprenger force-pushed the feat/remove-episodes branch from eacb1fc to 997133c Compare April 13, 2025 15:43
@bsprenger
Copy link
Contributor Author

Hi @Chojins, I assume you are using the online visualization tool. Indeed, it will show an error, because the online tool would need the changes from this PR to be merged in order to work.

If you use this PR's branch and run the visualizer locally, it should work:

python lerobot/scripts/visualize_dataset_html.py --repo-id=chojins/chess_game_001_blue_stereo

When I run this, it appears that the episode was correctly removed!

However, thanks to you I did realize that there was another issue with the remove_episodes script. It was pushing to the hub but not retagging it. This means that if you delete the local cache it would not redownload correctly, and also other people would not be able to correctly download it.

I updated the script yesterday to fix this going forward, but you will need to repair your dataset on the hub. Here's how you can do it.

from huggingface_hub import HfApi

hub_api = HfApi()
# we have to delete the old tag, which points to an older commit of the dataset
hub_api.delete_tag('chojins/chess_game_001_blue_stereo', tag='v2.1', repo_type='dataset')

# then we create a new tag which points to the most recent commit (i.e. the ones that deleted episodes)
hub_api.create_tag('chojins/chess_game_001_blue_stereo', tag='v2.1', revision="main", repo_type="dataset")

@imstevenpmwork imstevenpmwork added the dataset Issues regarding data inputs, processing, or datasets label Apr 17, 2025
@robinhyg
Copy link
robinhyg commented Apr 25, 2025

Hi @bsprenger , thanks very much for the great work. I got a new question during tring your script. If I use the remove script to delete some episodes from the head or middle of a dataset, can I still resume to collect new episodes based on this reduced dataset? (using python control_robot.py --control.type=record --control.resume=true) I observed that deleting episodes will reduce the values of "meta.info.total_episodes" and "meta.info.total_frames", and these two parameters will affect the global index value on the newly added episodes when resume to record data. This may lead index overlap on (parquet file names,meta.episode,"episode_index" key and "index" keys) with previous episodes.

For example, if I have a dataset with 10 episodes and delete episode_index=1,2,3, then the dataset's meta.episode becomes [0, 4, 5, 6, 7, 8, 9] and meta.info.total_episodes=7. When a new episode is added to this reduced dataset, its episode_index=meta.info.total_episodes=7, which overlap with the existing episode_index=7 in the dataset. This overlap will exist on (parquet file name,meta.episode,"episode_index" key,"index" keys), which will lead to visualization or training failure. Is there any way to resolve this overlap? I often encounter situations where I need to clean the dataset and then supplement it with additional new data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issues regarding data inputs, processing, or datasets enhancement Suggestions for new features or improvements
< 4B27 div class="discussion-sidebar-heading text-bold"> Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0