8000 Determine where to release datasets · Issue #6 · LLNL/dapper · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Determine where to release datasets #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nightlark opened this issue Dec 5, 2024 · 5 comments
Closed

Determine where to release datasets #6

nightlark opened this issue Dec 5, 2024 · 5 comments
Assignees

Comments

@nightlark
Copy link
Collaborator
nightlark commented Dec 5, 2024

We need to find somewhere to host the datasets that get created. Ideally the download should be available over HTTP(S), and not require users to log in.

Options include:

  • GitHub (versioning for a bucket of files)
  • Zenodo (versioning for a bucket of files)
  • HuggingFace (versioning for a bucket of files)
  • Kaggle (confusing)
  • Oregon State University Open Source Lab (ftp directory/mirror)
  • LC Green Data Oasis (needs renewing every year)
  • LLNL DSI Open Data Initiative/UCSD
@nightlark
Copy link
Collaborator Author

Regardless of dataset host -- have a version.<something> file with list of published data sets, and links to the full URL for each.

@nightlark
Copy link
Collaborator Author
nightlark commented Feb 27, 2025

LLNL DSI Open Data Initiative/UCSD: Sounds interesting, at least as an index to help people discover the datasets. Need to get more information on what dataset hosting at UCSD looks like. Should fill out the README and metadata templates provided prior to meeting with UCSD in 1-2 weeks.

@nightlark
Copy link
Collaborator Author

Next steps:

  • early next week, pick GitHub or HuggingFace to upload current version of datasets to
  • continue waiting to see if meeting with UCSD gets scheduled

@nightlark
Copy link
Collaborator Author

I did a download speed comparison, HuggingFace was the fastest and Zenodo was kinda slow but has the best retention policy (lifetime of CERN, at least 1-2 more decades).

  • HuggingFace: 40-50 mb/s
  • GitHub Releases: 30ish mb/s
  • Zenodo: 10 mb/s

I created organizations/communities on both HuggingFace and Zenodo, with the idea that HuggingFace will be used as the main download mirror, and Zenodo for long term archive/citing/linking to the datasets publicly (Zenodo also asks for more metadata when uploading a file).

Send me emails/usernames to get invited to the corresponding orgs, or if they have request to join button click that:

The one lingering question I have is how to name the repositories created -- nuget, ubuntu, debian, pypi? or (for the distros specifically) ubuntu-noble, debian-sid to also capture which major version the dataset was created from, since in a sense each major version of Ubuntu has its own separate package repository, whereas nuget and pypi just dump everything under a single index?

@nightlark
Copy link
Collaborator Author

For uploaded datasets, take the sqlite database (e.g. pypi.db) and toss it in a zip file (pypi.db.zip) with no subfolders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0