-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add Cuda/NVDEC decoding (torchcodec gpu) #913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
Things to note:
Also the GPU option will be infered during training by checking the |
This is a wonderful work @jadechoghari ! Offloading a percentage of the decoding to the GPU will be critical with higher number of frames as context and higher resolution. Since it's still not the case now, I would suggest to leave this PR open as a draft. What do you think? |
sure! let's keep it open as a draft and gather more signals — definitely worth having it ready as things scale in the future 👀 |
What this does
This PR adds CUDA decoding to the training script.
We compared it against torchcodec-cpu using lerobot video benchmark. See results here:
Note: gpu decoding with torchcodec pays upfront costs: context initialization (constant) and frame transfer (scales with resolution). while GPU decoding (via NVDECs) is faster per frame than single-threaded cpu, it's only a win when
init + transfer < total
cpu decode time. as video resolution increases, gpu wins more often. however, modern cpus with many threads can still outperform gpu decoding due to limited NVDEC units (typically <10). this tradeoff matters when deciding backend defaults or optimizing for throughput.What's left
Compare training speed, investigate more when CPU can become a bottleneck and CUDA decoding becomes a win, test trainign speed on DOT Policy (to be added soon): #739WandB logs and evals can be found here