8000 Pretraining with News Crawls by WMT 19 · Issue #31 · getalp/Flaubert · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Pretraining with News Crawls by WMT 19 #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
divkakwani opened this issue Jul 10, 2020 · 1 comment
Open

Pretraining with News Crawls by WMT 19 #31

divkakwani opened this issue Jul 10, 2020 · 1 comment

Comments

@divkakwani
Copy link
divkakwani commented Jul 10, 2020

I have a query regarding regarding your training corpus.

The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?

So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?

@formiel
Copy link
Contributor
formiel commented Jul 19, 2020

Hi @divkakwani,

Sorry for the late reply. That is a great remark.

If Next Sentence Prediction (NSP) is included, then using individual (non-contiguous) sentences hurts the performance (as shown by the RoBERTa paper). In Flaubert, however, we didn't use NSP, so I guess the difference (if any) is small, but obviously we need experiments to confirm this. Furthermore, please note that News Crawl is the only corpus that was shuffled (representing about 12% of the total data), so the impact shouldn't be big.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0