-
Notifications
You must be signed in to change notification settings - Fork 56
feat: Enable vLLM cudagraphs #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
de91c35
to
e84ff82
Compare
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>
@jiemingz can you also add timing plot to the MR description showing benefits of enabling cuda graphs vs not. |
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>
Signed-off-by: Jimmy Zhang <133159885+jiemingz@users.noreply.github.com>
Unit test failure here with the eager key missing: @jiemingz |
Addresses: !186
The generation throughput shows about ~3% speedup for llama8b on 4 nodes
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
Additional Information