8000 Lazy sentence split and tokenization by mh-northlander · Pull Request #231 · WorksApplications/Sudachi · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Lazy sentence split and tokenization #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 26, 2024

Conversation

mh-northlander
Copy link
Collaborator
@mh-northlander mh-northlander commented Jun 24, 2024

Current tokenization methods perform analysis on whole input texts and may cause OOM with long input.
This PR adds methods for a lazy analysis.

WIP:

  • add test
  • IOTools.readAsMuchAsCan may separate surrogate pair.

@mh-northlander
Copy link
Collaborator Author

sonarcloud fails as SentenceSplittingLazyAnalysis has duplicated code to SentenceSplittingAnalysis, but I want to keep this as they are. I think we eventually replace SentenceSplittingAnalysis by SentenceSplittingLazyAnalysis.

Maybe we can replace it now, although it changes the way how the IOException is thrown.

@mh-northlander mh-northlander changed the title WIP: Lazy sentence split and tokenization Lazy sentence split and tokenization Jun 25, 2024
@mh-northlander mh-northlander requested a review from kazuma-t June 25, 2024 02:44
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
3.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

@mh-northlander mh-northlander merged commit 26e731b into develop Jun 26, 2024
3 of 4 checks passed
@mh-northlander mh-northlander deleted the feature/lazy-tokenize-sentences branch June 26, 2024 01:27
@azagniotov
Copy link

Hello Team, @kazuma-t and @mh-northlander , I see a lot of good work here, any updates when a new version of Sudachi will be released to Maven Central? 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0