8000 fix the introduction of an additional space and split words in subtokens by andompesta · Pull Request #1198 · NVIDIA/NeMo-Guardrails · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix the introduction of an additional space and split words in subtokens #1198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

andompesta
Copy link
Contributor

Description

Fix a problem in which words composed in sub-tokens are spitted in the based sub-tokens.
@Pouyanpi

Related Issue(s)

Solve issue: #1197

Checklist

  • [ x ] I've read the CONTRIBUTING guidelines.
  • [ x ] I've updated the documentation if applicable.
  • I've added tests if applicable.
  • [ x ] @mentions of the person or team responsible for reviewing proposed changes.

@andompesta andompesta marked this pull request as draft May 16, 2025 08:46
@Pouyanpi Pouyanpi added the bug Something isn't working label May 16, 2025
@Pouyanpi Pouyanpi added this to the v0.14.0 milestone May 16, 2025
@Pouyanpi Pouyanpi requested review from Pouyanpi and Copilot May 16, 2025 12:22
Copy link
@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.62%. Comparing base (85400a5) to head (2acf809).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1198   +/-   ##
========================================
  Coverage    68.62%   68.62%           
========================================
  Files          161      161           
  Lines        15966    15966           
========================================
  Hits         10957    10957           
  Misses        5009     5009           
Flag Coverage Δ
python 68.62% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
nemoguardrails/rails/llm/llmrails.py 87.12% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -1311,8 +1311,8 @@ def _prepare_params(
if stream_first:
words = chunk_str_rep.split()
if words:
yield words[0]
for word in words[1:]:
# yield words[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds an extra space before the first word, no?
I think the previous logic was good here.

@@ -1297,7 +1297,7 @@ def _prepare_params(
)

async for chunk_list, chunk_str_rep in buffer_strategy(streaming_handler):
chunk_str = " ".join(chunk_list)
chunk_str = "".join(chunk_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I assume all tokenizers are adding spaces and any other whitespace and special chars in some of the chunks, so this should work for any LLM provider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0