8000 GitHub - PythonNut/superbpe: Official code release for "SuperBPE: Space Travel for Language Models"
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

PythonNut/superbpe

Repository files navigation

SuperBPE: Space Travel for Language Models

arXiv website HuggingFace

This repository contains the tokenizer training code. Code for other aspects of the project (e.g. evals, model scaling, data processing, wandb, train configs) will be added soon!

Setup

First, clone the project with:

git clone --recurse-submodules https://github.com/PythonNut/superbpe.git

We use a custom fork of huggingface/tokenizers which conflicts with the original. Because of this, we recommend always installing this project in its own virtual environment.

Setup virtual environment

Using conda

conda create -n superbpe python=3.12 rust
conda activate superbpe
pip install -r requirements.txt

Using venv

You will need to install rust and Python 3.12. Then, you can do:

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Data download

Our tokenizer training data is available here. You can download it using huggingface-cli (after logging into your HuggingFace account) using:

mkdir olmo-mix-1124-subset-p99
cd olmo-mix-1124-subset-p99
huggingface-cli download UW/olmo-mix-1124-subset-p99 --repo-type dataset --local-dir .

Tokenizer training

Training a SuperBPE tokenizer involves two stages:

  1. Stage 1: Learn subwords by enforcing whitespace pretokenization (equivalent to regular BPE training).
python -m train_tokenizer \
    --output_dir tokenizers/olmo2_bpe \
    --corpus_dir olmo-mix-1124-subset-p99/train \
    --num_bytes $((10**10)) \
    --vocab_size 200000 \
    --do_whitespace_pretokenization true
  1. Stage 2: Learn superwords by resuming tokenizer training, but this time skip the whitespace pretokenization step.
orig_tokenizer_dir=tokenizers/olmo2_bpe
num_inherit_merges=180000
output_dir=tokenizers/olmo2_superbpe

mkdir -p $output_dir

# inherit the first num_inherit_merges from the BPE tokenizer
head -n $num_inherit_merges $orig_tokenizer_dir/merges.txt > $output_dir/merges.txt

# specifies the same training files used in stage 1
cp $orig_tokenizer_dir/meta.json $output_dir/meta.json

python -m train_tokenizer \
    --output_dir $output_dir \
    --vocab_size 200000 \
    --do_whitespace_pretokenization false

After tokenizer training, you need to update the decoder field in the tokenizer.json to make sure it looks like this.

"decoder": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": true,
    "use_regex": true
}

About

Official code release for "SuperBPE: Space Travel for Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0