-
Notifications
You must be signed in to change notification settings - Fork 17
feat: performance improvement and Qwen3 support #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… into feature/openai_api
return tensor_dtype; | ||
} | ||
|
||
inline size_t torch_dtype_size(int dtype) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might not use a tensor item every time, so constructing a tensor just to query its itemsize() might be unnecessarily expensive.
inline size_t torch_dtype_size(int dtype) {
switch (dtype) {
case DTYPE_FLOAT32:
return 4;
case DTYPE_FLOAT16:
return 2;
case DTYPE_BFLOAT16:
return 2;
case DTYPE_FP8_E4M3FN:
return 1;
default:
throw std::invalid_argument("Unknown dtype in torch_dtype_size()");
}
}
// std::endl; TORCH_CHECK(output.is_contiguous(), "Output tensor must be | ||
// contiguous"); TORCH_CHECK(w1.is_contiguous() && w2.is_contiguous() && | ||
// w3.is_contiguous(), "Weight tensors must be contiguous"); | ||
// TORCH_CHECK(hidden.is_contiguous(), "Hidden tensor must be contiguous"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering—was there a specific reason for removing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the QWen3 MoE model and implements several performance improvements by overlapping expert copying, introducing fused kernels, CUDA graph support, and refined memory allocators.
- Added
Qwen3MoeForCausalLM
to model mappings and constants - Refactored expert modules with a
DECLARE_MODULE
macro and introducedMoEMLP
using CUDA graphs - Overhauled caching allocators and fused MLP kernels for reduced overhead
- Updated examples, documentation, and CI workflows for Ubuntu 22.04
Reviewed Changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
moe_infinity/common/constants.py | Added QWen3 model to imports and mappings |
examples/interface_example.py | Switched to chat template and cleaned dataset loading |
core/parallel/expert_module.h | Refactored expert modules with macros and new fields |
core/memory/caching_allocator.h | Introduced templated caching allocator |
core/model/fused_mlp.{h,cu} | Added fused MLP CUDA kernel and launcher |
.github/workflows/* | Upgraded Ubuntu runner from 20.04 to 22.04 |
Comments suppressed due to low confidence (1)
core/parallel/expert_dispatcher.h:49
- [nitpick] The default
num_threads
was reduced from 8 to 1, which may degrade parallel throughput. If this is intentional, please document the rationale or expose it as a configurable parameter.
explicit ExpertDispatcher(int num_experts, int num_layers, int dtype, int expert_type, int num_threads = 1);
Description
Major changes for performance improvement
Motivation
Type of Change
Checklist