8000 [REQ] Use 8B/16B vectorized loads in CUDA kernels for performance · Issue #712 · NVIDIA/warp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[REQ] Use 8B/16B vectorized loads in CUDA kernels for performance #712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adenzler-nvidia opened this issue May 8, 2025 · 0 comments
Open
Assignees
Labels
feature request Request for something to be added
Milestone

Comments

@adenzler-nvidia
Copy link
Contributor

Description

We should add a fast path to load wider types using float2/float4 loads as this can give a nice performance boost for memory-heavy kernels.

Context

We observed good speedups in Mujoco-Warp doing this, and would like to have this available for any type that is properly aligned such that users can build padded data types on top of it.

@adenzler-nvidia adenzler-nvidia added this to the 1.8.0 milestone May 8, 2025
@adenzler-nvidia adenzler-nvidia self-assigned this May 8, 2025
@adenzler-nvidia adenzler-nvidia added the feature request Request for something to be added label May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for something to be added
Projects
None yet
Development

No branches or pull requests

1 participant
0