For context, read this blog article: Training a Large Language Model With Metaflow, Featuring Dolly
This repository trains Dolly, a large language model recently announced by Databricks Labs.
Please visit the original repository to learn more about Dolly's origins, and fair use.
The main contributions of this repository are:
- Reproduce the Dolly training process on Outerbounds platform.
- A Metaflow flow that runs Dolly training on multiple GPUs using Outerbounds platform, or your own Metaflow deployment.
- A
@gpu_profile
decorator that you can reuse for any Metaflow task, to monitor GPU utilization. - A Streamlit app that lets you interact with different versions of Dolly when testing.
Dolly - and therefore this repository - is intended exclusively for research purposes and is not licensed for commercial use due to its dependency on the Alpaca Dataset. You will need to create your own instruction tuning dataset to use this repository for commercial applications.
This code should be run on a GPU. We tested it in two environments:
- AWS
p3dn.24xlarge
EC2 instance with 8 NVIDIA V100 GPUs- Deep Learning AMI (Ubuntu 20.04) with id
ami-0a39ed2b865d65970
(release notes here) - smaller
p3
instances worked too, but less reliably and efficiently
- Deep Learning AMI (Ubuntu 20.04) with id
- Coreweave node with 3 NVIDIA A100 GPUs
- Ubuntu 22.04
- NVIDIA driver version 515.105.01
- CUDA Version 11.7
In general, we found it is best to have at least 3 A100 GPUs. You will also need a large amount of CPU memory, as the training process requires significant RAM as deepspeed shares the model state across GPUs.
python -m venv env
source env/bin/activate
pip install -r requirements.txt
If you have access to the Outerbounds platform, install the outerbounds
package to connect to your organization's deployment.
pip install -U outerbounds
After installing outerbounds
, find and run the command like outerbounds configure <YOUR KEY>
in your platform onboarding documentation.
If you do not have access to the Outerbounds platform and want to run on a Metaflow deployment you manage, you can install open-source Metaflow normally (or add it to the requirements.txt
file).
pip install metaflow
To get started with your own deployment, follow our guides for engineers and/or reach out in our community Slack for help.
python train_dolly.py run
python train_dolly.py card view train
If you want to look at where this information comes from, you can look at the my_decorators.py
file, which defines the @gpu_profile
decorator.
This decorator currently assumes that you have nvidia-smi
installed on the machine where the train step runs, and therefore that you are running on NVIDIA GPUs.
Now you can make a prediction using the trained model. To do this, you can run the app.py
Streamlit app.
This will launch a web app that allows you to interact with Dolly.
streamlit run app.py
Although you can run the above streamlit app locally if you have the GPUs to make inference times reasonable, you may want to run it on a remote instance.
To set up the Streamlit server on a remote instance, and interact with it from your laptop, you can:
- Set up a remote instance, such as on AWS EC2. Similar to during model training, you will want to select an instance with GPUs, such as a
p3.16xlarge
instance on AWS. As during training, we use theami-0a39ed2b865d65970
deep learning AMI. Add a security role allowing your IP address to read from TCP port 8501, which is where Streamlit runs. - Make a
ssh
connection to your EC2 instance. - Install GitHub CLI tool by copy and pasting this in the terminal.
type -p curl >/dev/null || (sudo apt update && sudo apt install curl -y)
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \
&& sudo chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null \
&& sudo apt update \
&& sudo apt install gh -y
- Log in with
gh auth login
- Clone this repository with
gh repo clone outerbounds/dolly-ops && cd dolly-ops
- Create an environment with
mamba create -n dolly python=3.9 -y && mamba init && source ~/.bashrc && mamba activate dolly
- Install the
requirements.txt
file withpip install -r requirements.txt
- Install Metaflow with
pip install metaflow
if you are not on Outerbounds platform. If you are on Outerbounds platform, install Metaflow withpip install -U outerbounds
and then use yourouterbounds configure <YOUR KEY>
to connect to your organization's deployment. Make sure your Metaflow config matches the one used during training. - Run the Streamlit app with
streamlit run app.py
. - From your a web browser on your laptop, open the
External URL
that is printed in the terminal. Then you can interact with the models. Note it takes a few minutes to download the models the first time you try to load each one, since the models are 10s of GBs.