8000 Clarification on Reddit Data Usage in OLMo1 vs OLMo2 Pretraining · Issue #824 · allenai/OLMo · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Clarification on Reddit Data Usage in OLMo1 vs OLMo2 Pretraining #824
Open
@LeoLaugier

Description

@LeoLaugier

❓ The question

Hi, and thank you for your work on this project!

I'm looking for a fully transparent language model and would like to confirm some details regarding the pretraining data—particularly concerning Reddit data (for my project, I'm concerned about potential data contamination and evaluation of natural language generation capabilities).

From what I understand:

  • OLMo1 was pretrained on the Dolma dataset, which includes Reddit data sourced via the PushShift API as part of its broader composition (alongside Common Crawl and others).
  • OLMo2, however, was pretrained on DCLM-BASELINE, which appears to rely solely on Common Crawl.

Here are my questions:

  1. Is my understanding of the differences in pretraining data between OLMo1 and OLMo2 accurate regarding the usage of Reddit?
  2. If so, could you share why Reddit was omitted from OLMo2’s pretraining dataset? I noticed that Reddit data doesn't seem to be explicitly filtered out in either Dolma's Common Crawl subset or DCLM-BASELINE. Therefore, could there be potential overlap between Dolma's Common Crawl subset and Dolma's PushShift Reddit subset that led to this decision?

Thanks in advance for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/questionAn issue that's a question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0