GitHub - Ave-Sergeev/Tokenomicon: Text tokenization service (Rust, Axum) (2025)

Tokenomicon

Description

This project was created for the purpose of familiarization with Rust language and Axum framework. Using tokenization as a service logic is just fun. It is better to put such things in a crate.

The project is also the first step on the way to a full-fledged implementation of LLM on the Transformer architecture. Some of the logic from this service (namely Byte-level BPE) will be further used there.

The project is a web server designed for text tokenization using the Rust language and the Axum framework.

Support for the following features has been added:

Tokenization by words (words)
Tokenization by characters (chars)
Tokenization by BPE method (without training)
Tokenization Byte-level BPE method (with training).

For the BPE method (without training), a pre-trained dictionary of tokens is used as a vocab. You can read more here BPEmb.

Some BPE implementation details

Function tokenize: Tokenizes (segments) the received text. It does this by breaking it into sentences and then into words, processing them, and assembling them into a final result.

Function split_into_sentences: Breaks the text into sentences and applies a labeling function to each sentence.

Function tokenize_sentence_with_markers: Adds start/end markers to each sentence and then tokenizes the individual words in the sentence.

Function tokenize_word:

Check the incoming word (text) for emptiness. If true, return an empty vector, otherwise continue processing.
Next, we convert the word (text) into a vector of characters.
After that we loop through all possible subword lengths from the largest to the smallest. For each subword length we look for the maximum match in the dictionary.
If a match is found, we split the word into three parts: left, right, and the found subword (candidate). Each of these parts is recursively tokenized.
If no subword is found, we return the token of the unknown word.

Some details of the Byte-level BPE implementation

vocab: A dictionary that maps tokens (as byte sequences) to their identifiers.
reverse_vocab: A reverse dictionary that maps identifiers to tokens.
merges: A list of token pairs that have been merged during training.
unk_id: An identifier for unknown tokens used in encoding and decoding.

Function train: Constructs a dictionary of tokens and a merge list to reach the given dictionary size (vocab_size). The training process includes:

Convert the input text to bytes and initialize the tokens as individual bytes.
Copy the current vocab dictionary and reverse_vocab dictionary. Determine the next available ID for the new tokens.
All unique bytes are collected and added to the dictionary if they are not already present in the dictionary.
As long as the dictionary size is smaller than the given vocab_size, find the most frequent pair of tokens and merge them into a new token.
We update the tokens by replacing the found pairs with the new token. If there are no more pairs to merge, we break the cycle.
We save the updated dictionary and the reverse dictionary in the structure.

Function encode:

Convert the input text to bytes and initialize the tokens as individual bytes.
Apply merge rules to combine tokens into longer sequences.
Convert each token to its identifier from the vocab dictionary. If the token is not in the vocabulary, we use unk_id.
Return a vector of token identifiers.

Function decode:

initialize an empty byte vector.
For each identifier in the sequence, find the corresponding token in the reverse_vocab dictionary. If no token is found, use UNKNOWN_TOKEN.
We add the token bytes to the resulting vector.
Convert the vector of bytes to a string and return it.

Glossary

Tokenization (segmentation) is the process of breaking text into individual parts (words, characters, etc.)
BPE (Byte Pair Encoding) is an algorithm used in Natural Language Processing (NLP) for tokenization (segmenting text into smaller units).
Byte-level BPE is a subtype of BPE that uses bytes instead of characters as the main component of a token.

API

Endpoint: `/api/v1/tokenize/simple`

Method: POST.
Description: Root for simple tokenization of text by words or characters.

Request body (with field descriptions):

text (mandatory): Text to be tokenized.
method (mandatory): The method to be used for tokenization (chars and words are available).

{
   "method": "words",
   "text": "Hello, world! This is a test."
}

Expected successful response:

{
   "tokens": ["Hello,", "world!", "This", "is", "a", "test."]
}

Endpoint: `/api/v1/tokenize/standard-bpe`

Method: POST.
Description: Root for BPE-based tokenization (using a ready-made dictionary).

Request body (with field descriptions):

text (mandatory): Text to be tokenized.

{
    "text": "Hello, world! This is a test."
}

Expected successful response:

{
   "tokens": ["<s>", "▁hello", "▁world", "</s>", "<s>", "▁this", "▁is", "▁a", "▁test", "</s>"]
}

Endpoint: `/api/v1/tokenize/byte-level-bpe/train`

Method: POST.
Description: Root for Byte-level BPE dictionary training.

Request body (with field descriptions):

size (mandatory): Dictionary size limit.
text (mandatory): The corpus of text to be trained on.

{
    "size": 30,
    "text": "Hello, world! This is a test."
}

Expected successful response:

{
   "vocab_size": 30,
   "vocab": {"t": 16, "H": 5, "e": 17, "o": 13, "a": 1, "T": 6, "! ": 21, "r": 15, "o,": 22, "i": 10, "!": 8, "st.": 29, "h": 12, "is": 18, ".": 3, "l": 2, "is is ": 25, "d": 9, " ": 11, "<unk>": 0, "orl": 23, "w": 14, "o, ": 27, "a ": 24, "s": 7, "t.": 28, "ll": 26, "is ": 19, "or": 20, ",": 4}
}

Endpoint: `/api/v1/tokenize/byte-level-bpe/encode`

Method: POST.
Description: Root for tokenization (vector representation) using the Byte-level BPE method.

Request body (with field descriptions):

text (mandatory): Text to be tokenized.

{
    "text": "This is a test!"
}

Expected successful response:

{
   "tokens": [6, 12, 25, 24, 16, 17, 7, 16, 8]
}

Endpoint: `/api/v1/tokenize/byte-level-bpe/decode`

Method: POST.
Description: Root for converting tokenization (vector representation) back to text.

Request body (with field descriptions):

tokens (mandatory): Vector representation of tokens.

{
   "tokens": [6, 12, 25, 24, 16, 17, 7, 16, 8]
}

Expected Successful Response:

{
   "text": "This is a test!"
}

Local startup

To install Rust on unix-like systems (MacOS, Linux, ...) - run the command in the terminal. After the download is complete, you will get the latest stable version of Rust for your platform, as well as the latest version of Cargo.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Run the following command in the terminal to test it.

cargo --version

Open the project, and run the commands.

Check the code to see if it can be compiled (without running it).

cargo check

Build + run the project (in release mode with optimizations).

cargo run --release

UDP: If you have Windows, see Instructions here.

P.S.

Don't forget to leave a ⭐ if you found this project useful.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
vocab		vocab
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
README.ru.md		README.ru.md
config.yaml		config.yaml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tokenomicon

Description

Some BPE implementation details

Some details of the Byte-level BPE implementation

Glossary

API

Endpoint: `/api/v1/tokenize/simple`

Request body (with field descriptions):

Expected successful response:

Endpoint: `/api/v1/tokenize/standard-bpe`

Request body (with field descriptions):

Expected successful response:

Endpoint: `/api/v1/tokenize/byte-level-bpe/train`

Request body (with field descriptions):

Expected successful response:

Endpoint: `/api/v1/tokenize/byte-level-bpe/encode`

Request body (with field descriptions):

Expected successful response:

Endpoint: `/api/v1/tokenize/byte-level-bpe/decode`

Request body (with field descriptions):

Expected Successful Response:

Local startup

P.S.

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Ave-Sergeev/Tokenomicon

Folders and files

Latest commit

History

Repository files navigation

Tokenomicon

Description

Some BPE implementation details

Some details of the Byte-level BPE implementation

Glossary

API

Endpoint: /api/v1/tokenize/simple

Request body (with field descriptions):

Expected successful response:

Endpoint: /api/v1/tokenize/standard-bpe

Request body (with field descriptions):

Expected successful response:

Endpoint: /api/v1/tokenize/byte-level-bpe/train

Request body (with field descriptions):

Expected successful response:

Endpoint: /api/v1/tokenize/byte-level-bpe/encode

Request body (with field descriptions):

Expected successful response:

Endpoint: /api/v1/tokenize/byte-level-bpe/decode

Request body (with field descriptions):

Expected Successful Response:

Local startup

P.S.

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Endpoint: `/api/v1/tokenize/simple`

Endpoint: `/api/v1/tokenize/standard-bpe`

Endpoint: `/api/v1/tokenize/byte-level-bpe/train`

Endpoint: `/api/v1/tokenize/byte-level-bpe/encode`

Endpoint: `/api/v1/tokenize/byte-level-bpe/decode`

Packages