8000 feat: performance improvement and Qwen3 support by drunkcoding · Pull Request #60 · EfficientMoE/MoE-Infinity · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: performance improvement and Qwen3 support #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

drunkcoding
Copy link
Contributor

Description

Major changes for performance improvement

Motivation

  • Support latest QWen3 MoE model
  • Overlap hidden states gather with expert copy
  • Reduce torch kernel launch overhead

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Checklist

  • I have read the CONTRIBUTION guide.
  • I have updated the tests (if applicable).
  • I have updated the documentation (if applicable).

@drunkcoding drunkcoding requested a review from lausannel May 12, 2025 21:55
@lausannel lausannel requested a review from Copilot June 2, 2025 03:31
Copilot

This comment was marked as outdated.

return tensor_dtype;
}

inline size_t torch_dtype_size(int dtype) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not use a tensor item every time, so constructing a tensor just to query its itemsize() might be unnecessarily expensive.

inline size_t torch_dtype_size(int dtype) {
  switch (dtype) {
    case DTYPE_FLOAT32:
      return 4;
    case DTYPE_FLOAT16:
      return 2;
    case DTYPE_BFLOAT16:
      return 2;
    case DTYPE_FP8_E4M3FN:
      return 1;
    default:
      throw std::invalid_argument("Unknown dtype in torch_dtype_size()");
  }
}

// std::endl; TORCH_CHECK(output.is_contiguous(), "Output tensor must be
// contiguous"); TORCH_CHECK(w1.is_contiguous() && w2.is_contiguous() &&
// w3.is_contiguous(), "Weight tensors must be contiguous");
// TORCH_CHECK(hidden.is_contiguous(), "Hidden tensor must be contiguous");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering—was there a specific reason for removing this?

@drunkcoding drunkcoding requested a review from Copilot June 14, 2025 12:48
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the QWen3 MoE model and implements several performance improvements by overlapping expert copying, introducing fused kernels, CUDA graph support, and refined memory allocators.

  • Added Qwen3MoeForCausalLM to model mappings and constants
  • Refactored expert modules with a DECLARE_MODULE macro and introduced MoEMLP using CUDA graphs
  • Overhauled caching allocators and fused MLP kernels for reduced overhead
  • Updated examples, documentation, and CI workflows for Ubuntu 22.04

Reviewed Changes

Copilot reviewed 45 out of 45 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
moe_infinity/common/constants.py Added QWen3 model to imports and mappings
examples/interface_example.py Switched to chat template and cleaned dataset loading
core/parallel/expert_module.h Refactored expert modules with macros and new fields
core/memory/caching_allocator.h Introduced templated caching allocator
core/model/fused_mlp.{h,cu} Added fused MLP CUDA kernel and launcher
.github/workflows/* Upgraded Ubuntu runner from 20.04 to 22.04
Comments suppressed due to low confidence (1)

core/parallel/expert_dispatcher.h:49

  • [nitpick] The default num_threads was reduced from 8 to 1, which may degrade parallel throughput. If this is intentional, please document the rationale or expose it as a configurable parameter.
explicit ExpertDispatcher(int num_experts, int num_layers, int dtype, int expert_type, int num_threads = 1);

@lausannel lausannel changed the title Performance improvement and QWen3 Support feat: performance improvement and Qwen3 support Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0