8000 GitHub - kyleliang919/Super_Muon
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kyleliang919/Super_Muon

Repository files navigation

Super Muon

People complain about Muon not being able to do tensor parallel efficiently. But here is the thing, you can simply just shard it and do Newton Schulz on the local copies. This repo is adopted from the Kimi toy example

Run Command

torchrun \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv-backend=c10d \
    --nnodes=1 \
    --nproc-per-node=1 \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    moonlight_toy_train.py --model qwen --optimizer sharded_muon --dataset openwebtext-100k --hidden_size 896 --lr 1e-3 --run_name sharded_muon

Preliminary Result

Sharded Muon Results

You can get away with no all-gather result

MudamW, because why not

result

SignMuon Aka Mu-Lion

You can communicate gradient in binary for all-gather result

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0