8000 Add Cuda/NVDEC decoding (torchcodec gpu) by jadechoghari · Pull Request #913 · huggingface/lerobot · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add Cuda/NVDEC decoding (torchcodec gpu) #913

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
8000
from

Conversation

jadechoghari
Copy link
Collaborator
@jadechoghari jadechoghari commented Mar 28, 2025

What this does

This PR adds CUDA decoding to the training script.
We compared it against torchcodec-cpu using lerobot video benchmark. See results here:

Backend Timestamps Mode Load Time (ms) MSE PSNR SSIM
torchcodec-cpu 1_frame 12.80 0.000056 44.50 0.9947
torchcodec-cpu 2_frames 7.31 0.000057 44.50 0.9948
torchcodec-cpu 6_frames 5.16 0.000061 44.39 0.9946
torchcodec-gpu 1_frame 271.54 0.002487 26.04 0.4754
torchcodec-gpu 2_frames 121.25 0.002492 26.04 0.4746
torchcodec-gpu 6_frames 40.70 0.002495 26.03 0.4746

Note: gpu decoding with torchcodec pays upfront costs: context initialization (constant) and frame transfer (scales with resolution). while GPU decoding (via NVDECs) is faster per frame than single-threaded cpu, it's only a win when init + transfer < total cpu decode time. as video resolution increases, gpu wins more often. however, modern cpus with many threads can still outperform gpu decoding due to limited NVDEC units (typically <10). this tradeoff matters when deciding backend defaults or optimizing for throughput.

What's left

Compare training speed, investigate more when CPU can become a bottleneck and CUDA decoding becomes a win, test trainign speed on DOT Policy (to be added soon): #739
WandB logs and evals can be found here

@jadechoghari jadechoghari marked this pull request as draft March 28, 2025 10:45
@imstevenpmwork imstevenpmwork added enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage labels Mar 28, 2025
@jadechoghari
Copy link
Collaborator Author
jadechoghari commented Apr 3, 2025

Things to note:
To use torchcodec on gpu you need to:

  1. Install torchcodec 0.2.1 with cuda enabled: This corresponds to CUDA Toolkit version 12.6. It should be the same one
    you used when you installed PyTorch (If 8000 you installed PyTorch with pip).
    pip install torchcodec>=0.2.1 --index-url=https://download.pytorch.org/whl/cu126

  2. conda install -c nvidia libnpp

  3. Make sure FFMPEG (preferably version 7) so NVDEC can work: conda install ffmpeg=7 -c conda-forge

Also the GPU option will be infered during training by checking the dataset.video_backend, to easily test it just add this code in the training script under dataset = make_dataset(cfg): dataset.video_backend="torchcodec-gpu"

@jadechoghari jadechoghari marked this pull request as ready for review April 3, 2025 14:15
@jadechoghari jadechoghari requested a review from Cadene April 3, 2025 14:44
@Cadene
Copy link
Collaborator
Cadene commented Apr 3, 2025

This is a wonderful work @jadechoghari ! Offloading a percentage of the decoding to the GPU will be critical with higher number of frames as context and higher resolution. Since it's still not the case now, I would suggest to leave this PR open as a draft. What do you think?

@jadechoghari
Copy link
Collaborator Author

sure! let's keep it open as a draft and gather more signals — definitely worth having it ready as things scale in the future 👀

@jadechoghari jadechoghari marked this pull request as draft April 3, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0