8000 fix: `RecursiveSplitter` bug in the case when the recursive chunking is triggered by davidsbatista · Pull Request #9316 · deepset-ai/haystack · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: RecursiveSplitter bug in the case when the recursive chunking is triggered #9316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 30, 2025

Conversation

davidsbatista
Copy link
Contributor
@davidsbatista davidsbatista commented Apr 28, 2025

Related Issues

Proposed Changes

  • Removed the line that was making a cumulative count in the case where a split_text is found to be longer than split_length and recursive chunking is triggered.

  • A new test case was added to the RecursiveDocumentSplitter that verifies its behavior regarding the issue mentioned above.

How did you test it?

  • Adding the new test function test_run_complex_text_with_multiple_separators() to the existing test file
  • Running the test suite to verify that the new test passes + CI tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@davidsbatista davidsbatista changed the title initial import fix: RecursiveSplitter bug in the case where a split_text is found to be longer than split_length and recursive chunking is triggered Apr 28, 2025
@davidsbatista davidsbatista changed the title fix: RecursiveSplitter bug in the case where a split_text is found to be longer than split_length and recursive chunking is triggered fix: RecursiveSplitter bug in the case when the recursive chunking is triggered Apr 28, 2025
@coveralls
Copy link
Collaborator
coveralls commented Apr 28, 2025

Pull Request Test Coverage Report for Build 14727388496

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.001%) to 90.499%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/recursive_splitter.py 2 90.67%
Totals Coverage Status
Change from base Build 14709162118: -0.001%
Covered Lines: 10878
Relevant Lines: 12020

💛 - Coveralls

@davidsbatista davidsbatista marked this pull request as ready for review April 29, 2025 09:00
@davidsbatista davidsbatista requested review from a team as code owners April 29, 2025 09:00
@davidsbatista davidsbatista requested review from dfokina, Amnah199 and julian-risch and remove 8000 d request for a team April 29, 2025 09:00
Copy link
Member
@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@@ -426,7 +426,7 @@ def test_run_separator_exists_but_split_length_too_small_fall_back_to_character_
result = splitter.run(documents=[doc])
assert len(result["documents"]) == 10
for doc in result["documents"]:
if re.escape(doc.content) not in ["\ "]:
if re.escape(doc.content) not in ["\\ "]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember seeing a warning about a regex. I assume this helps to get rid of that warning. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's it 👍🏽

@julian-risch julian-risch removed the request for review from Amnah199 April 30, 2025 10:20
@davidsbatista davidsbatista merged commit 201becd into main Apr 30, 2025
24 checks passed
@davidsbatista davidsbatista deleted the fixing-splitting-bug-accumulative-count branch April 30, 2025 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug found in RecursiveDocumentSplitter
3 participants
0