Mistral: A strong and cool northwesterly wind that builds as it moves, bringing good health and clear skies.
A framework for transparent and accessible large-scale language model training, built with Hugging Face π€ . Includes tools and helpful scripts for incorporating new pre-training datasets, various schemes for single node and distributed training - including on cloud providers like GCP, and importantly, scripts for evaluation.
Visit our Read the Docs for the full documentation.
A Propulsion Endeavor π
Mistral
is built to facilitate transparent and accessible training. To do our best to reach this goal, we will hold community meetings
twice a month we'll give updates as to where we're at and what we're working on, and more importantly, hear from you as to how we can help
and possibly work together.
We would love for folks from academia, other community efforts, as well as those in industry to join - all are welcome. The first meeting will be on Monday, August 30th at 4 PM PT.
We'll post the future dates (and times - which we hope to move around through the day to maximally engage folks in varied timezones) after the first meeting!
The dependencies for Mistral can be installed using Conda. Note that the provided environment assumes that CUDA 11.0 is installed. You may need to adjust the environment YAML file depending on your set up.
git clone https://github.com/stanford-crfm/mistral.git
cd mistral
conda env create -f environments/environment-gpu.yaml # Choose CUDA kernel based on the hardware!
If you are training on the CPU only, run conda env create -f environments/environment-cpu.yaml
instead.
First, make sure to update conf/tutorial-gpt2-micro.yaml
with the directories you want to store the Hugging Face
cache and model runs.
# Artifacts & Caching
artifacts:
cache_dir: /path/to/artifacts
run_dir: /path/to/runs
Next, make sure that /path/to/mistral
is on your PYTHONPATH
.
For single-node single-gpu training, run:
conda activate mistral
cd mistral
CUDA_VISIBLE_DEVICES=0 python train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 1 --nproc_per_node 1 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 2 --run_id tutorial-gpt2-micro
Modify /job/hostfile
in the following way:
<Hostname of first machine> slots=<Number of GPUs>
<Hostname of second machine> slots=<Number of GPUs>
...
<Hostname of the nth machine> slots=<Number of GPUs>
Below is an example hostfile where we train on machine1
and machine2
with 8 GPUs each:
machine1 slots=8
machine2 slots=8
To start distributed training, run:
conda activate mistral
cd mistral
deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z1-conf.json --run_id tutorial-gpt2-micro-multi-node > tutorial-gpt2-micro-multi-node.out 2> tutorial-gpt2-micro-multi-node.err
Note: You may need to adjust your batch size depending on the capacity of your GPUs.
If you are interested in training a model on Google Cloud, check out our Google Cloud + Kubernetes Tutorial.
Model checkpoints will be stored in the directory specified by the artifacts.run_dir
. An example checkpoint might be
in /path/to/runs/tutorial-gpt2-micro/checkpoint-1000
.
Mistral stores model checkpoints in the Hugging Face format, so models can be loaded and used in the same manner as if one had trained the model with Hugging Face.
For instance, to generate text with π€ Transformers (you will need to clone the transformers repo):
conda activate mistral
cd transformers/examples/text-generation
python run_generation.py --model_type=gpt2 --model_name_or_path=/path/to/runs/tutorial-gpt2-micro/checkpoint-1000
Or to load the model in Python code (make sure /path/to/mistral
is in your PYTHONPATH
):
from src.models.mistral_gpt2 import MistralGPT2LMHeadModel
model = MistralGPT2LMHeadModel.from_pretrained("/path/to/runs/tuto
8E67
rial-gpt2-micro/checkpoint-1000")
The Propulsion team has trained 5 GPT-2 Medium models and 5 GPT-2 Small models on the OpenWebText corpus, as found in π€ datasets.
Checkpoints can be loaded as Hugging Face models. For each model, we provide checkpoints at 100k, 200k, 300k and 400k steps.
We have also stored over 600 checkpoints for each model, subject to the following checkpoint schedule:
- Every 10 Steps, for the first 0 - 100 Steps.
- Every 50 Steps, from 100 - 2000 Steps.
- Every 100 Steps, from 2000 - 20,000 Steps.
- Every 1000 Steps, from 20,000 - 400,000 Steps.
This comes out to 610 checkpoints per run, taking up ~22TB for all 10 models (making it pretty expensive to host!) If you are interested in acquiring these additional checkpoints, please file an issue or contact Laurel (lorr1) and Sidd (skaramcheti) at their @cs.stanford.edu email addresses, and we'll be happy to figure out a cost-effective solution to sharing them.
GPT-2 Medium
Run | Type | Checkpoint | Size | Link |
---|---|---|---|---|
Arwen | GPT-2 Medium | 400000 | 4.9G | download |
Arwen | GPT-2 Medium | 300000 | 4.9G | download |
Arwen | GPT-2 Medium | 200000 | 4.9G | download |
Arwen | GPT-2 Medium | 100000 | 4.9G | download |
Beren | GPT-2 Medium | 400000 | 4.9G | download |
Beren | GPT-2 Medium | 300000 | 4.9G | download |
Beren | GPT-2 Medium | 200000 | 4.9G | download |
Beren | GPT-2 Medium | 100000 | 4.9G | download |
Celebrimbor | GPT-2 Medium | 400000 | 4.9G | download |
Celebrimbor | GPT-2 Medium | 300000 | 4.9G | download |
Celebrimbor | GPT-2 Medium | 200000 | 4.9G | download |
Celebrimbor | GPT-2 Medium | 100000 | 4.9G | download |
Durin | GPT-2 Medium | 400000 | 4.9G | download |
Durin | GPT-2 Medium | 300000 | 4.9G | download |
Durin | GPT-2 Medium | 200000 | 4.9G | download |
Durin | GPT-2 Medium | 100000 | 4.9G | download |
Eowyn | GPT-2 Medium | 400000 | 4.9G | download |
Eowyn | GPT-2 Medium | 300000 | 4.9G | download |
Eowyn | GPT-2 Medium | 200000 | 4.9G | download |
Eowyn | GPT-2 Medium | 100000 | 4.9G | download |
GPT-2 Small
Run | Type | Checkpoint | Size | Link |
---|---|---|---|---|
Alias | GPT-2 Small | 400000 | 1.8G | download |
Alias | GPT-2 Small | 300000 | 1.8G | download |
Alias | GPT-2 Small | 200000 | 1.8G | download |
Alias | GPT-2 Small | 100000 | 1.8G | download |
Battlestar | GPT-2 Small | 400000 | 1.8G | download |
Battlestar | GPT-2 Small | 300000 | 1.8G | download |
Battlestar | GPT-2 Small | 200000 | 1.8G | download |
Battlestar | GPT-2 Small | 100000 | 1.8G | download |
Caprica | GPT-2 Small | 400000 | 1.8G | download |
Caprica | GPT-2 Small | 300000 | 1.8G | download |
Caprica | GPT-2 Small | 200000 | 1.8G | download |
Caprica | GPT-2 Small | 100000 | 1.8G | download |
Darkmatter | GPT-2 Small | 400000 | 1.8G | download |
Darkmatter | GPT-2 Small | 300000 | 1.8G | download |
Darkmatter | GPT-2 Small | 200000 | 1.8G | download |
Darkmatter | GPT-2 Small | 100000 | 1.8G | download |
Expanse | GPT-2 Small | 400000 | 1.8G | download |
Expanse | GPT-2 Small | 300000 | 1.8G | download |
Expanse | GPT-2 Small | 200000 | 1.8G | download |
Expanse | GPT-2 Small | 100000 | 1.8G | download |
To ask questions, report issues, or request features, please use the GitHub Issue Tracker. Before creating a new issue, please make sure to search for existing issues that may solve your problem.
Please see the following page for information on contributing.