fix the introduction of an additional space and split words in subtokens #1198

andompesta · 2025-05-15T09:39:00Z

Description

Fix a problem in which words composed in sub-tokens are spitted in the based sub-tokens.
@Pouyanpi

Related Issue(s)

Solve issue: #1197

Checklist

[ x ] I've read the CONTRIBUTING guidelines.
[ x ] I've updated the documentation if applicable.
I've added tests if applicable.
[ x ] @mentions of the person or team responsible for reviewing proposed changes.

Copilot

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

codecov-commenter · 2025-05-16T12:26:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.62%. Comparing base (85400a5) to head (2acf809).

Additional details and impacted files

@@           Coverage Diff            @@
##           develop    #1198   +/-   ##
========================================
  Coverage    68.62%   68.62%           
========================================
  Files          161      161           
  Lines        15966    15966           
========================================
  Hits         10957    10957           
  Misses        5009     5009

Flag	Coverage Δ
python	`68.62% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
nemoguardrails/rails/llm/llmrails.py	`87.12% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

trebedea · 2025-06-17T12:06:56Z

nemoguardrails/rails/llm/llmrails.py

@@ -1311,8 +1311,8 @@ def _prepare_params(
            if stream_first:
                words = chunk_str_rep.split()
                if words:
-                    yield words[0]
-                    for word in words[1:]:
+                    # yield words[0]


This adds an extra space before the first word, no?
I think the previous logic was good here.

trebedea · 2025-06-17T12:25:03Z

nemoguardrails/rails/llm/llmrails.py

@@ -1297,7 +1297,7 @@ def _prepare_params(
        )

        async for chunk_list, chunk_str_rep in buffer_strategy(streaming_handler):
-            chunk_str = " ".join(chunk_list)
+            chunk_str = "".join(chunk_list)


Good catch! I assume all tokenizers are adding spaces and any other whitespace and special chars in some of the chunks, so this should work for any LLM provider.

fix the introduction of an additional space and split words in subtokens

2acf809

andompesta marked this pull request as draft May 16, 2025 08:46

Pouyanpi added the bug Something isn't working label May 16, 2025

Pouyanpi added this to the v0.14.0 milestone May 16, 2025

Pouyanpi requested review from Pouyanpi and Copilot May 16, 2025 12:22

Copilot AI reviewed May 16, 2025

View reviewed changes

yield words as they are

959742c

trebedea reviewed Jun 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix the introduction of an additional space and split words in subtokens #1198

fix the introduction of an additional space and split words in subtokens #1198

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix the introduction of an additional space and split words in subtokens #1198

Are you sure you want to change the base?

fix the introduction of an additional space and split words in subtokens #1198

Uh oh!

Conversation

Description

Related Issue(s)

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!