revamp default space tokenization - review the ((newline)) thing #33

vince62s · 2024-06-14T16:10:09Z

We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)

However it would be better to rely on the tokenizer to split the text in tokens

it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)

It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).

vince62s added enhancement New feature or request help wanted Extra attention is needed labels Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

revamp default space tokenization - review the ((newline)) thing #33

revamp default space tokenization - review the ((newline)) thing #33

revamp default space tokenization - review the ((newline)) thing #33

revamp default space tokenization - review the ((newline)) thing #33

Comments