8000 GitHub - b8zhong/torch-moe-transformer
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

b8zhong/torch-moe-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyTorch MoE

PyTorch MoE is a modular implementation of a Mixture-of-Experts transformer using PyTorch. It is designed for flexibility, extensibility, and performance. The architecture focuses on sparse expert routing, efficient attention, and scalable tokenization.

Features

  • Sparse expert activation with top-k routing per token
  • Rotary embeddings and grouped-query attention
  • Attention with KV caching for autoregressive use cases
  • Expert capacity limits with residual fallback on overflow
  • Parallelized BPE tokenizer with deterministic merges
  • Modular PyTorch components for experimentation

Components

MoEBlock
Implements token-to-expert routing. Each token selects k experts with highest router scores. Experts are capped in capacity; excess tokens are handled by residual pass-through.

TransformerBlock
Layer block combining rotary attention with either dense FF or sparse MoE. Supports per-head configuration for queries and keys/values.

MoeTransformer
Stacked transformer with shared embeddings. Exposes encoder and decoder functions through tied parameters.

Tokenizer
Multiprocess BPE vocabulary builder. Compatible with GPT-style byte-level encoding. Outputs human-readable vocab and merge rules.

Usage

Tokenizer

python tokenizer/parallel_bpe.py -w 8 -n 300 -d data_cache/

Saves merges.txt and vocabs.txt under data_cache/.

Install

pip install -r requirements.txt

Notes

  • Top-k expert selection is differentiable and batch-aware
  • Attention uses rotary position encoding on keys and queries
  • Grouped-query attention reduces KV head duplication
  • BPE uses regex-based segmentation compatible with GPT-4 patterns

Work in Progress

This repo is not complete for end-to-end training. Missing features include:

  • Dataset pipeline and loss function
  • Optimizer and training loop
  • Generation/sampling logic
  • Flash attention or fused kernels

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0