This repository contains the tokenizer training code. Code for other aspects of the project (e.g. evals, model scaling, data processing, wandb, train configs) will be added soon!
First, clone the project with:
git clone --recurse-submodules https://github.com/PythonNut/superbpe.git
We use a custom fork of huggingface/tokenizers which conflicts with the original. Because of this, we recommend always installing this project in its own virtual environment.
conda create -n superbpe python=3.12 rust
conda activate superbpe
pip install -r requirements.txt
You will need to install rust and Python 3.12. Then, you can do:
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Our tokenizer training data is available here.
You can download it using huggingface-cli
(after logging into your HuggingFace account) using:
mkdir olmo-mix-1124-subset-p99
cd olmo-mix-1124-subset-p99
huggingface-cli download UW/olmo-mix-1124-subset-p99 --repo-type dataset --local-dir .
Training a SuperBPE tokenizer involves two stages:
- Stage 1: Learn subwords by enforcing whitespace pretokenization (equivalent to regular BPE training).
python -m train_tokenizer \
--output_dir tokenizers/olmo2_bpe \
--corpus_dir olmo-mix-1124-subset-p99/train \
--num_bytes $((10**10)) \
--vocab_size 200000 \
--do_whitespace_pretokenization true
- Stage 2: Learn superwords by resuming tokenizer training, but this time skip the whitespace pretokenization step.
orig_tokenizer_dir=tokenizers/olmo2_bpe
num_inherit_merges=180000
output_dir=tokenizers/olmo2_superbpe
mkdir -p $output_dir
# inherit the first num_inherit_merges from the BPE tokenizer
head -n $num_inherit_merges $orig_tokenizer_dir/merges.txt > $output_dir/merges.txt
# specifies the same training files used in stage 1
cp $orig_tokenizer_dir/meta.json $output_dir/meta.json
python -m train_tokenizer \
--output_dir $output_dir \
--vocab_size 200000 \
--do_whitespace_pretokenization false
After tokenizer training, you need to update the decoder
field in the tokenizer.json
to make sure it looks like this.
"decoder": {
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": true,
"use_regex": true
}