8000 Tokens and characters · Issue #203 · second-state/chat-with-chatgpt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tokens and characters #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
juntao opened this issue Mar 31, 2023 · 19 comments
Closed

Tokens and characters #203

juntao opened this issue Mar 31, 2023 · 19 comments

Comments

@juntao
Copy link
Member
juntao commented Mar 31, 2023

What is the context length in number of characters for gpt-3?

Copy link
Member

GPT-3, or the third generation of the Generative Pre-trained Transformer model developed by OpenAI, has a maximum context length of 2048 tokens, or approximately 9000 characters. However, the actual context length achievable may depend on multiple factors, such as the computational resources available or the length of specific tokens used in the context.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

How do you compute the number of tokens from a paragraph of ascii text?

Copy link
Member

To compute the number of tokens from a paragraph of ASCII text, you can follow these steps:

  1. Split the paragraph into individual words, phrases or symbols.
  2. Remove any unwanted characters such as punctuation marks or special symbols that are not relevant to the meaning of the text.
  3. Assign each remaining word, phrase or symbol to a token.

Here is an example code snippet in Python that splits a string of ASCII text and counts the number of tokens:

import re

text = "This is a sample paragraph of ASCII text. It contains a few sentences."

# Split the text into individual words and remove punctuation/symbols
tokens = re.findall(r'\b\w+\b', text)

# Print the number of tokens
print(len(tokens))

In this code, we use the regular expression \b\w+\b to match all words in the text and remove any punctuation or symbols. We then count the number of remaining tokens using the len() function.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

Do duplicated words count as one token or multiple tokens?

Copy link
Member

It depends on how you define a 'token'. If you define a token to be a unique sequence of characters, then duplicated words would count as a single token, because they have the same sequence of characters.

However, in some natural language processing tasks, such as sentiment analysis or text classification, the occurrence of certain words or phrases may be important, regardless of whether they are duplicated or not. In such cases, duplicated words or phrases would count as multiple tokens.

So the answer depends on the specific task and how the tokens are defined for that task.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

Do you mean that in GPT-3 duplicated words are counting as multiple tokens?

Copy link
Member

In GPT-3, duplicated words will usually count as multiple tokens, because each token represents a unique word or subword in the input text.

GPT-3 uses a technique called byte-pair encoding (BPE) to segment words into subwords, and each subword is assigned a distinct token. For example, the word "running" might be segmented into the subwords "run" and "ning", which would be assigned different tokens.

While BPE reduces the vocabulary size by encoding rare or out-of-vocabulary words as sequences of subwords, it effectively increases the number of tokens in the text. As a result, duplicated words or phrases will count as multiple tokens, since they may be comprised of different subwords in different contexts.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

Can you describe what is BPE?

Copy link
Member

Byte pair encoding (BPE) is a compression technique used in natural language processing (NLP) to reduce the vocabulary size, especially for languages with large vocabularies.

In BPE, the most frequent pairs of consecutive bytes in a text corpus are iteratively replaced with a single, unused byte. This process effectively replaces each pair with a new symbol, which can be considered a subword, and collapses the multi-byte words into single units. The idea is that the most frequently occurring subword units will remain as single characters, while less frequent subword units will be combined to form longer subword sequences.

By segmenting words into subwords, BPE enables a model to handle rare or unseen words that were not present in the training vocabulary, and ultimately enables more effective language modeling.

Here is an example of how BPE can be used to segment a word:

Suppose we have the following text corpus containing the word "running":

"Running is a great way to stay active and healthy. Runners enjoy running in the morning."

Here is how we might apply BPE in three iterations:

  1. Identify the most frequent pair of bytes and replace with a new symbol:
run + nning -> runn~    # ~ is a new symbol
  1. Identify the next most frequent pair and replace again:
u + nn~ -> unn~~       # ~~ is a new symbol
run + ning -> runni~~
  1. Repeat for remaining pairs until a desired number of subwords is reached:
r + unn~~ -> runn~~~
runni~~ + ng -> runnin~~~

Finally, the word "running" is segmented into three subwords: "runn~~~", "in~~~", and "g". These subwords can be treated as distinct tokens by a natural language processing model such as GPT-3.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

For an average English text, how many tokens would 1000 ascii characters map to?

Copy link
Member

The number of tokens that 1000 ASCII characters would map to in an average English text can vary depending on the specific text and how tokens are defined. However, as a very rough estimate, we can assume that each word on average is about 5 characters long and that a space character separates each word, which means that a typical English text of 1000 ASCII characters would contain around 200 tokens.

This is a rough estimate, and the actual number of tokens can vary depending on various factors, such as the use of punctuation, the length of uncommon words, and the presence of abbreviations, acronyms, or emoticons. Additionally, if the text contains a significant amount of non-English characters or words, the number of tokens may differ considerably from this estimate.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

How about an average computer program source code file? How many tokens would 1000 ascii characters in source code map to?

Copy link
Member

The number of tokens that 1000 ASCII characters of source code would map to depends on various factors such as the specific programming language, syntax, and coding style. However, here are some rough estimates based on common programming languages:

  • Python: In Python, 1000 ASCII characters would contain around 150-200 tokens, assuming an average token length of 5-7 characters. This can vary depending on the syntax, indentation, and the use of comments or white space in the code.

  • Java: For Java code, 1000 ASCII characters would map to around 130-150 tokens, assuming an average token length of 6-8 characters. This can vary depending on the style of code and the use of comments, white space, or long class names.

  • C/C++: In C or C++, 1000 ASCII characters would yield around 120-140 tokens, assuming an average token length of 7-9 characters. This can vary depending on the coding style, the use of comments or white space, and the inclusion of longer function, variable, or structure names.

Again, these are rough estimates, and the actual number of tokens can vary depending on the specific code file and how tokens are defined. For example, if we were to count each symbol or operator as a separate token, the number of tokens in the code snippet could be much higher.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

In a conversation thread with ChatGPT, does the number of tokens only count the current question and answer? Or does it include previous messages in the conversation?

Copy link
Member

The concept of tokens in a conversation thread with ChatGPT or any other chatbot or language model generally only applies to the current question and answer. In other words, the number of tokens would likely be calculated only for the most recent message sent by the user and the response given by the chatbot.

Previous messages in the conversation thread are usually not relevant to tokenization since they have already been processed and are no longer part of the current context. However, in some situations, context from previous messages in the thread may be incorporated into the tokenization process to provide better quality responses.

Overall, the specific approach to tokenization in a conversation thread will depend on the design of the chatbot or language model and the task it is intended to perform.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

Does OpenAI have a tool to compute the number of tokens from input text?

Copy link
Member

OpenAI provides an API for GPT-3 that can be used to generate text or perform other natural language processing (NLP) tasks. While the API does not provide a built-in function to compute the number of tokens in input text, there are several ways to calculate the number of tokens using programming languages such as Python.

OpenAI also provides a Python package called openai that can be used to access the GPT-3 API. This package includes several utility functions for tokenization and other NLP tasks. For example, the openai.tokenize() function tokenizes an input string into a list of tokens:

import openai

text = "This is an example sentence to tokenize."
tokens = openai.tokenize(text, engine="text-davinci-002")
print(len(tokens)) # output: 7

This example uses the openai.tokenize() function to tokenize the input string into individual words and returns the number of tokens in the text, which is 7 in this case.

Note that the actual number of tokens computed can depend on the specific GPT-3 API engine used and the tokenization rules implemented in the engine.

@juntao
Copy link
Member Author
juntao commented Mar 31, 2023

That's very helpful. Thank you.

Copy link
Member

You're welcome! Glad to be of help. If you have any other questions or concerns, feel free to ask!

@juntao juntao closed this as completed Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0