You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)
However it would be better to rely on the tokenizer to split the text in tokens
it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)
It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).
The text was updated successfully, but these errors were encountered:
We inherited from the onmt-py space tokenization which split all text based on the whitespace " " (and only this one versus all python whitespace before onmt-py 3.4)
However it would be better to rely on the tokenizer to split the text in tokens
it would be easier to handle multispaces, multitabs, linebreaks (\r, \n, etc ...)
It would require to review all transforms because at the moment they receive list of tokens (they should now receive strings or streams).
The text was updated successfully, but these errors were encountered: