DDT is a CLI written in python (and rewritten in golang if i have time) that will scan a directory and count the number of tokens per file, subdivided by filetype. Useful for figuring out how difficult it will be for a Large Language Model to hold the entirety of a given set of files in its context window.
Warning
While functional, DDT is still in its very early stages and is not considered
stable. Versioning will follow pre-v0 semantic versioning where minor version
updates indicate non-backwards compatible changes in addition to new features.
Patch versions will focus on bugfixes, non-breaking feature improvements, and
early access to new features behind a special --experimental {command}
flag.
Note
Binary distribution is not working yet, Please follow "Building from source" below.
- Install with pipx with the command
pipx install ddt
- Run
ddt /path/to/target
- Install with pip with the command
pip install ddt
- Run
ddt /path/to/target
- Clone the repository.
- Using the
uv
package manager, runuv build
- Run
pipx install dist/ddt-x.y.z-py3-none-any.whl
, where x.y.z is the version you installed. - Run
ddt /path/to/target
- Set up a python virtual environment with
python3 -m venv .venv
- Enter the virtual environment with
source .venv/bin/activate
- Run
python -m pip install -e .
- Run
ddt /path/to/target
- Remember to run
exit
when you're done to leave the python venv!
usage: Tokenizer [-h] [-c CONFIG] [-v] [-g] [-d] [-s] [-i] [-r] [-m MODEL] [-o OUTPUT] [--json | --html] [--exclude EXCLUDE | --include INCLUDE] directory
Crawls a given directory, counts the number of tokens per filetype in the project and returns a per-type total and grand total
positional arguments:
directory the relative or absolute path to the directory you wish to scan
options:
-h, --help show this help message and exit
-v, --verbose set to increase logging to console
-g, --include-gitignore
include files and directories found in the .gitignore file
-d, --include-dotfiles
include files and directories beginning with a dot (.)
-s, --include-symlinks
include files and directories symlinked from outside the target directory
-i, --include-images include image files found within the directory
-r, --resolve-paths resolve relative file paths to their absolute location
-m, --model specify a model to use for token approximation. default is 'gpt-4o'
-o, --output OUTPUT redirect output from STDOUT to a file at the location specified.
--json save the results of the scan to a json file
--html save the results of the scan to a HTML file
--exclude EXCLUDE specify file formats to ignore from counting. this flag may be set multiple times for multiple entries. cannot be set if including files
--include INCLUDE specify file formats to include when counting. this flag may be set multiple times for multiple entries. cannot be set if excluding files
Made with <3 by 0x4D5352
Beyond the traditional -h
and -v
flags, DDT can be configured a number of ways:
You can exclude filetypes by passing in one or more -exclude FILETYPE
flags,
or you can include only specific filetypes with one or more -include FILETYPE
flags.
You cannot specify both.
To save your output, pass the -o
or --output
flag followed by
a filename such as out.json
. To save the file in a structured output,
pass one of the corresponding flags:
--json
--html
DDT ignores dotfiles (e.g. .env
files and the .git
directory), respects the .gitignore
file, discards symlinks, and does
not tokenize images. If you wish to alter any of these defaults, use the
provided options:
--include-dotfiles
--include-gitignore
--include-symlinks
--include-images
DDT works with both relative and absolute file paths. If you wish to force
the output to print absolute file paths, pass the -r
or --resolve-paths
flag.
The default tokenizer algorithm is based on GPT-4o. If you need to check the
token counts for older models, such as gpt-4 or text-davinci-003,
pass the corresponding model name into the -m
or --model
flag.
If the model you wish to test is not listed, reference the model list from
this OpenAI Cookbook.
Large Language Models, despite the name, do not understand language like you or I. Instead, they understand an encoding (technically an embedding) of chunks of text called tokens. A token can be a single letter like 'a', or it can be a word like 'the'. Some tokens include leading spaces (e.g. ' be'), to preserve sentence structure. On average, a token works out to about 0.75 of an English word.
If you'd like to see examples of how text is broken up into tokens, check out OpenAI's Token Calculator in their API reference documentation.
Multimodal models have the ability to parse images as well as text. The calculation is based on how many 512x512 tiles can fit into a scaled version of the input image. To read more about how this process works, check out this article by Tee Kai Feng.
For a Large Language Model, the context window represents "active knowlege" - Information immediately relevant to the autocompletion algorithm LLMs use to generate text. These tokens are what influences how the LLM uses the data it has been trained on to predict the next token, which gets added to the context and included in the next prediction. When the context window is full, the model will begin behaving in unintended ways:
- The LLM might simply run out of tokens and stop generating text.
- The LLM might "forget" the oldest bit of information and behave strangely.
- The LLM might hallucinate information that no longer appears in the data.
- LLM Agents might lose functionality or send malformed input to its actions.
Context windows vary in size, with current models handling anywhere from 1,000 to 1,000,000 tokens. Even with the larger windows, it is common to limit the maximum number of tokens to reduce compute requirements and increase speed.
For reference: curl is approximately 1,750,000 tokens.