Add Cuda/NVDEC decoding (torchcodec gpu) #913

jadechoghari · 2025-03-28T10:45:27Z

What this does

This PR adds CUDA decoding to the training script.
We compared it against torchcodec-cpu using lerobot video benchmark. See results here:

Backend	Timestamps Mode	Load Time (ms)	MSE	PSNR	SSIM
torchcodec-cpu	1_frame	12.80	0.000056	44.50	0.9947
torchcodec-cpu	2_frames	7.31	0.000057	44.50	0.9948
torchcodec-cpu	6_frames	5.16	0.000061	44.39	0.9946
torchcodec-gpu	1_frame	271.54	0.002487	26.04	0.4754
torchcodec-gpu	2_frames	121.25	0.002492	26.04	0.4746
torchcodec-gpu	6_frames	40.70	0.002495	26.03	0.4746

Note: gpu decoding with torchcodec pays upfront costs: context initialization (constant) and frame transfer (scales with resolution). while GPU decoding (via NVDECs) is faster per frame than single-threaded cpu, it's only a win when init + transfer < total cpu decode time. as video resolution increases, gpu wins more often. however, modern cpus with many threads can still outperform gpu decoding due to limited NVDEC units (typically <10). this tradeoff matters when deciding backend defaults or optimizing for throughput.

What's left

~~Compare training speed, investigate more when CPU can become a bottleneck and CUDA decoding becomes a win, test trainign speed on DOT Policy (to be added soon): #739~~
WandB logs and evals can be found here

for more information, see https://pre-commit.ci

jadechoghari · 2025-04-03T14:11:05Z

Things to note:
To use torchcodec on gpu you need to:

Install torchcodec 0.2.1 with cuda enabled: This corresponds to CUDA Toolkit version 12.6. It should be the same one
you used when you installed PyTorch (If 8000 you installed PyTorch with pip).
pip install torchcodec>=0.2.1 --index-url=https://download.pytorch.org/whl/cu126
conda install -c nvidia libnpp
Make sure FFMPEG (preferably version 7) so NVDEC can work: conda install ffmpeg=7 -c conda-forge

Also the GPU option will be infered during training by checking the dataset.video_backend, to easily test it just add this code in the training script under dataset = make_dataset(cfg): dataset.video_backend="torchcodec-gpu"

Cadene · 2025-04-03T15:23:12Z

This is a wonderful work @jadechoghari ! Offloading a percentage of the decoding to the GPU will be critical with higher number of frames as context and higher resolution. Since it's still not the case now, I would suggest to leave this PR open as a draft. What do you think?

jadechoghari · 2025-04-03T15:37:29Z

sure! let's keep it open as a draft and gather more signals — definitely worth having it ready as things scale in the future 👀

jadechoghari added 2 commits March 28, 2025 00:36

add cuda decoding (torchcodec)

89b00cc

add assert cuda

db15745

jadechoghari marked this pull request as draft March 28, 2025 10:45

[pre-commit.ci] auto fixes from pre-commit.com hooks

a1548d5

for more information, see https://pre-commit.ci

imstevenpmwork added enhancement Suggestions for new features or improvements performance Issues aimed at improving speed or resource usage labels Mar 28, 2025

jadechoghari added 2 commits March 28, 2025 20:54

quick fix

7e681bc

Merge branch 'main' into torchcodec-gpu

838455d

jadechoghari marked this pull request as ready for review April 3, 2025 14:15

jadechoghari requested a review from Cadene April 3, 2025 14:44

jadechoghari marked this pull request as draft April 3, 2025 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cuda/NVDEC decoding (torchcodec gpu) #913

Add Cuda/NVDEC decoding (torchcodec gpu) #913

Add Cuda/NVDEC decoding (torchcodec gpu) #913

Are you sure you want to change the base?

Add Cuda/NVDEC decoding (torchcodec gpu) #913

Conversation

What this does

What's left