Open
Description
❓ The question
Hi, and thank you for your work on this project!
I'm looking for a fully transparent language model and would like to confirm some details regarding the pretraining data—particularly concerning Reddit data (for my project, I'm concerned about potential data contamination and evaluation of natural language generation capabilities).
From what I understand:
- OLMo1 was pretrained on the Dolma dataset, which includes Reddit data sourced via the PushShift API as part of its broader composition (alongside Common Crawl and others).
- OLMo2, however, was pretrained on DCLM-BASELINE, which appears to rely solely on Common Crawl.
Here are my questions:
- Is my understanding of the differences in pretraining data between OLMo1 and OLMo2 accurate regarding the usage of Reddit?
- If so, could you share why Reddit was omitted from OLMo2’s pretraining dataset? I noticed that Reddit data doesn't seem to be explicitly filtered out in either Dolma's Common Crawl subset or DCLM-BASELINE. Therefore, could there be potential overlap between Dolma's Common Crawl subset and Dolma's PushShift Reddit subset that led to this decision?
Thanks in advance for your help!