-
Notifications
You must be signed in to change notification settings - Fork 491
fix the introduction of an additional space and split words in subtokens #1198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
fix the introduction of an additional space and split words in subtokens #1198
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1198 +/- ##
========================================
Coverage 68.62% 68.62%
========================================
Files 161 161
Lines 15966 15966
========================================
Hits 10957 10957
Misses 5009 5009
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
@@ -1311,8 +1311,8 @@ def _prepare_params( | |||
if stream_first: | |||
words = chunk_str_rep.split() | |||
if words: | |||
yield words[0] | |||
for word in words[1:]: | |||
# yield words[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds an extra space before the first word, no?
I think the previous logic was good here.
@@ -1297,7 +1297,7 @@ def _prepare_params( | |||
) | |||
|
|||
async for chunk_list, chunk_str_rep in buffer_strategy(streaming_handler): | |||
chunk_str = " ".join(chunk_list) | |||
chunk_str = "".join(chunk_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I assume all tokenizers are adding spaces and any other whitespace and special chars in some of the chunks, so this should work for any LLM provider.
Description
Fix a problem in which words composed in sub-tokens are spitted in the based sub-tokens.
@Pouyanpi
Related Issue(s)
Solve issue: #1197
Checklist