Determine where to release datasets #6

nightlark · 2024-12-05T19:58:05Z

We need to find somewhere to host the datasets that get created. Ideally the download should be available over HTTP(S), and not require users to log in.

Options include:

GitHub (versioning for a bucket of files)
Zenodo (versioning for a bucket of files)
HuggingFace (versioning for a bucket of files)
~~Kaggle~~ (confusing)
Oregon State University Open Source Lab (ftp directory/mirror)
~~LC Green Data Oasis~~ (needs renewing every year)
LLNL DSI Open Data Initiative/UCSD

nightlark · 2025-02-13T22:53:58Z

Regardless of dataset host -- have a version.<something> file with list of published data sets, and links to the full URL for each.

nightlark · 2025-02-27T19:04:32Z

LLNL DSI Open Data Initiative/UCSD: Sounds interesting, at least as an index to help people discover the datasets. Need to get more information on what dataset hosting at UCSD looks like. Should fill out the README and metadata templates provided prior to meeting with UCSD in 1-2 weeks.

nightlark · 2025-03-13T21:40:14Z

Next steps:

early next week, pick GitHub or HuggingFace to upload current version of datasets to
continue waiting to see if meeting with UCSD gets scheduled

nightlark · 2025-03-19T20:25:17Z

I did a download speed comparison, HuggingFace was the fastest and Zenodo was kinda slow but has the best retention policy (lifetime of CERN, at least 1-2 more decades).

HuggingFace: 40-50 mb/s
GitHub Releases: 30ish mb/s
Zenodo: 10 mb/s

I created organizations/communities on both HuggingFace and Zenodo, with the idea that HuggingFace will be used as the main download mirror, and Zenodo for long term archive/citing/linking to the datasets publicly (Zenodo also asks for more metadata when uploading a file).

Send me emails/usernames to get invited to the corresponding orgs, or if they have request to join button click that:

HuggingFace: https://huggingface.co/dapper-datasets
Zenodo: https://zenodo.org/communities/dapper/

The one lingering question I have is how to name the repositories created -- nuget, ubuntu, debian, pypi? or (for the distros specifically) ubuntu-noble, debian-sid to also capture which major version the dataset was created from, since in a sense each major version of Ubuntu has its own separate package repository, whereas nuget and pypi just dump everything under a single index?

nightlark · 2025-03-20T21:56:08Z

For uploaded datasets, take the sqlite database (e.g. pypi.db) and toss it in a zip file (pypi.db.zip) with no subfolders.

nightlark self-assigned this Dec 5, 2024

nightlark mentioned this issue Dec 6, 2024

Add dataset update feature #13

Open

nightlark mentioned this issue Mar 13, 2025

Create Debian/Ubuntu package dataset #4

Closed

nightlark closed this as completed Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Determine where to release datasets #6

Determine where to release datasets #6

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Determine where to release datasets #6

Determine where to release datasets #6

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!