EXO Gym

Open source framework for simulated distributed training methods. Instead of training with multiple ranks, we simulate the distributed training process by running multiple nodes on a single machine.

Supported Devices

CPU
CUDA
MPS (CPU-bound for copy operations, see here)

Supported Methods

AllReduce (Equivalent to PyTorch DDP)
FedAvg
DiLoCo
SPARTA
DeMo

Installation

Basic Installation

Install with core dependencies only:

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ exogym

Installation with Optional Features

Optional feature flags allowed are:

wandb,gpt,demo,examples,all,dev

For example, pip install exogym[demo]

Development Installation

To install for development:

git clone https://github.com/exo-explore/gym.git exogym
cd exogym
pip install -e ".[dev]"

Usage

Example Scripts

MNIST comparison of DDP, DiLoCo, and SPARTA:

python run/mnist.py

NanoGPT Shakespeare DiLoCo:

python run/nanogpt_diloco.py --dataset shakespeare

Custom Training

from exogym import LocalTrainer
from exogym.strategy import DiLoCoStrategy

train_dataset, val_dataset = ...
model = ... # model.forward() expects a batch, and returns a scalar loss

trainer = LocalTrainer(model, train_dataset, val_dataset)

# Strategy for optimization & communication
strategy = DiLoCoStrategy(
  inner_optim='adam',
  H=100
)

trainer.fit(
  strategy=strategy,
  num_nodes=4,
  device='mps'
)

Codebase Structure

Trainer: Builds simulation environment. Trainer will spawn multiple TrainNode instances, connect them together, and starts the training run.
TrainNode: A single node (rank) running its own training loop. At each train step, instead of calling optim.step(), it calls strategy.step().
Strategy: Abstract class for an optimization strategy, which both defines how the nodes communicate with each other and how model weights are updated. Typically, a gradient strategy will include an optimizer as well as a communication step. Sometimes (eg. DeMo), the optimizer step is comingled with the communication.

Technical Details

EXO Gym uses pytorch multiprocessing to spawn a subprocess per-node, which are able to communicate with each other using regular operations such as all_reduce.

Model

The model is expected in a form that takes a batch (the same format as dataset outputs), and returns a scalar loss over the entire batch. This ensures the model is agnostic to the format of the data (eg. masked LM training doesn't have a clear x/y split).

Dataset

Recall that when we call trainer.fit(), $K$ subprocesses are spawned to handle each of the virtual workers. There are two options for creating dataset:

PyTorch `Dataset`

Instantiate a single Dataset. The dataset object is passed to every subprocess, and a DistributedSampler will be used to select which datapoints are sampled per-node (to ensure each datapoint is only used once by each node). If the dataset is entirely loaded into memory, this memory will be duplicated per-node - be careful not to run out of memory! If the dataset is larger, it should be lazily loaded.

`dataset_factory` function

In place of the dataset object, pass a function with the following signature:

def dataset_factory(rank: int, num_nodes: int, train_dataset: bool) -> torch.utils.data.Dataset

This will be called within each rank to build the dataset. Instead of each node storing the whole dataset and subsampling datapoints, each node only loads the necessary datapoints.

Name		Name	Last commit message	Last commit date
Latest commit History 246 Commits
example		example
exogym		exogym
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EXO Gym

Supported Devices

Supported Methods

Installation

Basic Installation

Installation with Optional Features

Development Installation

Usage

Example Scripts

Custom Training

Codebase Structure

Technical Details

Model

Dataset

PyTorch `Dataset`

`dataset_factory` function

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

exo-explore/gym

Folders and files

Latest commit

History

Repository files navigation

EXO Gym

Supported Devices

Supported Methods

Installation

Basic Installation

Installation with Optional Features

Development Installation

Usage

Example Scripts

Custom Training

Codebase Structure

Technical Details

Model

Dataset

PyTorch Dataset

dataset_factory function

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

PyTorch `Dataset`

`dataset_factory` function

Packages