Super Muon

People complain about Muon not being able to do tensor parallel efficiently. But here is the thing, you can simply just shard it and do Newton Schulz on the local copies. This repo is adopted from the Kimi toy example

Run Command

torchrun \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv-backend=c10d \
    --nnodes=1 \
    --nproc-per-node=1 \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    moonlight_toy_train.py --model qwen --optimizer sharded_muon --dataset openwebtext-100k --hidden_size 896 --lr 1e-3 --run_name sharded_muon

Preliminary Result

Sharded Muon Results

You can get away with no all-gather

MudamW, because why not

SignMuon Aka Mu-Lion

You can communicate gradient in binary for all-gather

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
optims		optims
LICENSE		LICENSE
README.md		README.md
moonlight_toy_results.png		moonlight_toy_results.png
moonlight_toy_train.py		moonlight_toy_train.py
mudamw_results.png		mudamw_results.png
sign_muon.jpg		sign_muon.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Super Muon

Run Command

Preliminary Result

Sharded Muon Results

MudamW, because why not

SignMuon Aka Mu-Lion

About

Releases

Packages

Languages

License

kyleliang919/Super_Muon

Folders and files

Latest commit

History

Repository files navigation

Super Muon

Run Command

Preliminary Result

Sharded Muon Results

MudamW, because why not

SignMuon Aka Mu-Lion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages