You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a query regarding regarding your training corpus.
The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?
So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?
The text was updated successfully, but these errors were encountered:
If Next Sentence Prediction (NSP) is included, then using individual (non-contiguous) sentences hurts the performance (as shown by the RoBERTa paper). In Flaubert, however, we didn't use NSP, so I guess the difference (if any) is small, but obviously we need experiments to confirm this. Furthermore, please note that News Crawl is the only corpus that was shuffled (representing about 12% of the total data), so the impact shouldn't be big.
I have a query regarding regarding your training corpus.
The News Crawl corpora that you use are both shuffled and de-duplicated. However, the corpora used by other models like BERT, RoBERTa etc. use a non-shuffled corpus where each document within the corpus is also demarcated with an empty line. Now with this un-shuffled form, when you create pre-training instances, you will end up with contiguous sentences in segment A and segment B. But in your case, the segments will contain non-contiguous sentences right?
So my question is what is your opinion on having non-contiguous sentences in the segments? Does it hurt the performance of MLM, or downstream tasks?
The text was updated successfully, but these errors were encountered: