This repository contains experiments comparing different gating strategies in Mixture-of-Experts (MoE) models, with a focus on sparse and attention-based routing mechanisms.
Install the required dependencies:
pip install -r requirements.txt
The main dependencies include:
- PyTorch
- Transformers
- PEFT
- Accelerate
- Datasets
- Evaluate
- ROUGE score
- Entmax (for sparse activation functions)
main_experiment.ipynb
: Primary notebook for running experiments with various gating strategies (recommended)basline.ipynb
: Initial experimental attempts for standard MoE gating mechanismsflops_comparison.ipynb
: Notebook for comparing computational efficiency (FLOPs) across different routing methods (top-k v.s. sparsemax)
The experiments should be run through the main_experiment.ipynb
Jupyter notebook. This notebook contains optimized code for implementing and evaluating different gating strategies.
To run the baseline gating experiments (as described in the paper):
This is the standard MoE gating mechanism that applies a linear projection to compute expert scores, followed by softmax normalization and top-k selection.
# In main_experiment.ipynb
router_type = "linear"
norm_type = "softmax"
top_k = 2 # Set to the desired k value
This applies softmax over all expert scores without top-k filtering, where all experts are activated and contribute to the final output.
# In main_experiment.ipynb
router_type = "linear"
norm_type = "softmax"
top_k = None # No top-k filtering - all experts are used
Replaces softmax with sparsemax applied to the linear gating outputs for differentiable sparse routing without requiring explicit top-k filtering.
# In main_experiment.ipynb
router_type = "linear"
norm_type = "sparsemax"
top_k = None # Sparsemax naturally produces sparse outputs
Computes expert relevance scores via scaled dot-product attention between the input and learned expert embeddings. The scores are passed through softmax and then top-k selection is applied.
# In main_experiment.ipynb
router_type = "attention"
norm_type = "softmax"
top_k = 2 # Set to the desired k value
Similar to the above, but without top-k filtering. All experts are softly weighted according to attention scores and contribute to the output.
# In main_experiment.ipynb
router_type = "attention"
norm_type = "softmax"
top_k = None # No top-k filtering - all experts are used
Applies sparsemax instead of softmax to the attention-derived scores, promoting sparse yet differentiable expert activation.
# In main_experiment.ipynb
router_type = "attention"
norm_type = "sparsemax"
top_k = None # Sparsemax naturally produces sparse outputs
The experiments are designed to run on a GPU-enabled environment. The code automatically configures the appropriate GPU usage.
The experiments use the SAMSum dataset for summarization tasks. The dataset is automatically loaded from Hugging Face Datasets.
The base model is google/switch-base-8
, which is a Switch Transformer with 8 experts per layer. The experiments modify the routing mechanisms of this model. We also scale to google/switch-base-16
with 16 experts per layer.
The experiments are evaluated using ROUGE scores, which measure the overlap between the generated and reference summaries.
After running each experiment, the results are displayed in the notebook, showing the performance metrics for each gating strategy.