This project was created for the purpose of familiarization with Rust language and Axum framework. Using tokenization as a service logic is just fun. It is better to put such things in a crate.
The project is also the first step on the way to a full-fledged implementation of LLM
on the Transformer
architecture.
Some of the logic from this service (namely Byte-level BPE
) will be further used there.
The project is a web server designed for text tokenization using the Rust language and the Axum framework.
Support for the following features has been added:
- Tokenization by words (words)
- Tokenization by characters (chars)
- Tokenization by BPE method (without training)
- Tokenization Byte-level BPE method (with training).
For the BPE method (without training), a pre-trained dictionary of tokens is used as a vocab
.
You can read more here BPEmb.
Function tokenize
:
Tokenizes (segments) the received text.
It does this by breaking it into sentences and then into words, processing them, and assembling them into a final
result.
Function split_into_sentences
:
Breaks the text into sentences and applies a labeling function to each sentence.
Function tokenize_sentence_with_markers
:
Adds start/end markers to each sentence and then tokenizes the individual words in the sentence.
Function tokenize_word
:
- Check the incoming word (text) for emptiness. If true, return an empty vector, otherwise continue processing.
- Next, we convert the word (text) into a vector of characters.
- After that we loop through all possible subword lengths from the largest to the smallest. For each subword length we look for the maximum match in the dictionary.
- If a match is found, we split the word into three parts: left, right, and the found subword (candidate). Each of these parts is recursively tokenized.
- If no subword is found, we return the token of the unknown word.
- vocab: A dictionary that maps tokens (as byte sequences) to their identifiers.
- reverse_vocab: A reverse dictionary that maps identifiers to tokens.
- merges: A list of token pairs that have been merged during training.
- unk_id: An identifier for unknown tokens used in encoding and decoding.
Function train
:
Constructs a dictionary of tokens and a merge list to reach the given dictionary size (vocab_size).
The training process includes:
- Convert the input text to bytes and initialize the tokens as individual bytes.
- Copy the current vocab dictionary and reverse_vocab dictionary. Determine the next available ID for the new tokens.
- All unique bytes are collected and added to the dictionary if they are not already present in the dictionary.
- As long as the dictionary size is smaller than the given vocab_size, find the most frequent pair of tokens and merge them into a new token.
- We update the tokens by replacing the found pairs with the new token. If there are no more pairs to merge, we break the cycle.
- We save the updated dictionary and the reverse dictionary in the structure.
Function encode
:
- Convert the input text to bytes and initialize the tokens as individual bytes.
- Apply merge rules to combine tokens into longer sequences.
- Convert each token to its identifier from the vocab dictionary. If the token is not in the vocabulary, we use unk_id.
- Return a vector of token identifiers.
Function decode
:
- initialize an empty byte vector.
- For each identifier in the sequence, find the corresponding token in the reverse_vocab dictionary. If no token is found, use UNKNOWN_TOKEN.
- We add the token bytes to the resulting vector.
- Convert the vector of bytes to a string and return it.
- Tokenization (segmentation) is the process of breaking text into individual parts (words, characters, etc.)
- BPE (Byte Pair Encoding) is an algorithm used in Natural Language Processing (NLP) for tokenization (segmenting text into smaller units).
- Byte-level BPE is a subtype of BPE that uses bytes instead of characters as the main component of a token.
- Method:
POST
. - Description: Root for simple tokenization of text by words or characters.
text
(mandatory): Text to be tokenized.method
(mandatory): The method to be used for tokenization (chars
andwords
are available).
{
"method": "words",
"text": "Hello, world! This is a test."
}
{
"tokens": ["Hello,", "world!", "This", "is", "a", "test."]
}
- Method:
POST
. - Description: Root for BPE-based tokenization (using a ready-made dictionary).
text
(mandatory): Text to be tokenized.
{
"text": "Hello, world! This is a test."
}
{
"tokens": ["<s>", "▁hello", "▁world", "</s>", "<s>", "▁this", "▁is", "▁a", "▁test", "</s>"]
}
- Method:
POST
. - Description: Root for Byte-level BPE dictionary training.
size
(mandatory): Dictionary size limit.text
(mandatory): The corpus of text to be trained on.
{
"size": 30,
"text": "Hello, world! This is a test."
}
{
"vocab_size": 30,
"vocab": {"t": 16, "H": 5, "e": 17, "o": 13, "a": 1, "T": 6, "! ": 21, "r": 15, "o,": 22, "i": 10, "!": 8, "st.": 29, "h": 12, "is": 18, ".": 3, "l": 2, "is is ": 25, "d": 9, " ": 11, "<unk>": 0, "orl": 23, "w": 14, "o, ": 27, "a ": 24, "s": 7, "t.": 28, "ll": 26, "is ": 19, "or": 20, ",": 4}
}
- Method:
POST
. - Description: Root for tokenization (vector representation) using the Byte-level BPE method.
text
(mandatory): Text to be tokenized.
{
"text": "This is a test!"
}
{
"tokens": [6, 12, 25, 24, 16, 17, 7, 16, 8]
}
- Method:
POST
. - Description: Root for converting tokenization (vector representation) back to text.
tokens
(mandatory): Vector representation of tokens.
{
"tokens": [6, 12, 25, 24, 16, 17, 7, 16, 8]
}
{
"text": "This is a test!"
}
- To install
Rust
on unix-like systems (MacOS, Linux, ...) - run the command in the terminal. After the download is complete, you will get the latest stable version of Rust for your platform, as well as the latest version of Cargo.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Run the following command in the terminal to test it.
cargo --version
- Open the project, and run the commands.
Check the code to see if it can be compiled (without running it).
cargo check
Build + run the project (in release mode with optimizations).
cargo run --release
UDP: If you have Windows, see Instructions here.
Don't forget to leave a ⭐ if you found this project useful.