SuperBPE: Space Travel for Language Models

This repository contains the tokenizer training code. Code for other aspects of the project (e.g. evals, model scaling, data processing, wandb, train configs) will be added soon!

Setup

First, clone the project with:

git clone --recurse-submodules https://github.com/PythonNut/superbpe.git

We use a custom fork of huggingface/tokenizers which conflicts with the original. Because of this, we recommend always installing this project in its own virtual environment.

Setup virtual environment

Using `conda`

conda create -n superbpe python=3.12 rust
conda activate superbpe
pip install -r requirements.txt

Using `venv`

You will need to install rust and Python 3.12. Then, you can do:

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Data download

Our tokenizer training data is available here. You can download it using huggingface-cli (after logging into your HuggingFace account) using:

mkdir olmo-mix-1124-subset-p99
cd olmo-mix-1124-subset-p99
huggingface-cli download UW/olmo-mix-1124-subset-p99 --repo-type dataset --local-dir .

Tokenizer training

Training a SuperBPE tokenizer involves two stages:

Stage 1: Learn subwords by enforcing whitespace pretokenization (equivalent to regular BPE training).

python -m train_tokenizer \
    --output_dir tokenizers/olmo2_bpe \
    --corpus_dir olmo-mix-1124-subset-p99/train \
    --num_bytes $((10**10)) \
    --vocab_size 200000 \
    --do_whitespace_pretokenization true

Stage 2: Learn superwords by resuming tokenizer training, but this time skip the whitespace pretokenization step.

orig_tokenizer_dir=tokenizers/olmo2_bpe
num_inherit_merges=180000
output_dir=tokenizers/olmo2_superbpe

mkdir -p $output_dir

# inherit the first num_inherit_merges from the BPE tokenizer
head -n $num_inherit_merges $orig_tokenizer_dir/merges.txt > $output_dir/merges.txt

# specifies the same training files used in stage 1
cp $orig_tokenizer_dir/meta.json $output_dir/meta.json

python -m train_tokenizer \
    --output_dir $output_dir \
    --vocab_size 200000 \
    --do_whitespace_pretokenization false

After tokenizer training, you need to update the decoder field in the tokenizer.json to make sure it looks like this.

"decoder": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": true,
    "use_regex": true
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
notebooks		notebooks
scripts		scripts
tokenizers_superbpe @ 757f2a5		tokenizers_superbpe @ 757f2a5
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
encode.py		encode.py
requirements.txt		requirements.txt
train_tokenizer.py		train_tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SuperBPE: Space Travel for Language Models

Setup

Setup virtual environment

Using `conda`

Using `venv`

Data download

Tokenizer training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

PythonNut/superbpe

Folders and files

Latest commit

History

Repository files navigation

SuperBPE: Space Travel for Language Models

Setup

Setup virtual environment

Using conda

Using venv

Data download

Tokenizer training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Using `conda`

Using `venv`

Packages