Pretraining with News Crawls by WMT 19 #31

divkakwani · 2020-07-10T18:55:51Z

I have a query regarding regarding your training corpus.

The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?

So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?

formiel · 2020-07-19T11:47:54Z

Hi @divkakwani,

Sorry for the late reply. That is a great remark.

If Next Sentence Prediction (NSP) is included, then using individual (non-contiguous) sentences hurts the performance (as shown by the RoBERTa paper). In Flaubert, however, we didn't use NSP, so I guess the difference (if any) is small, but obviously we need experiments to confirm this. Furthermore, please note that News Crawl is the only corpus that was shuffled (representing about 12% of the total data), so the impact shouldn't be big.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining with News Crawls by WMT 19 #31

Pretraining with News Crawls by WMT 19 #31

Pretraining with News Crawls by WMT 19 #31

Pretraining with News Crawls by WMT 19 #31

Comments