8000 revamp default space tokenization - review the ((newline)) thing · Issue #33 · eole-nlp/eole · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

revamp default space tokenization - review the ((newline)) thing #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vince62s opened this issue Jun 14, 2024 · 0 comments
Open

revamp default space tokenization - review the ((newline)) thing #33

vince62s opened this issue Jun 14, 2024 · 0 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@vince62s
Copy link
Contributor

We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)

However it would be better to rely on the tokenizer to split the text in tokens

it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)

It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).

@vince62s vince62s added enhancement New feature or request help wanted Extra attention is needed labels Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

1 participant
0