-
Notifications
You must be signed in to change notification settings - Fork 2
Determine where to release datasets #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Regardless of dataset host -- have a |
LLNL DSI Open Data Initiative/UCSD: Sounds interesting, at least as an index to help people discover the datasets. Need to get more information on what dataset hosting at UCSD looks like. Should fill out the README and metadata templates provided prior to meeting with UCSD in 1-2 weeks. |
Next steps:
|
I did a download speed comparison, HuggingFace was the fastest and Zenodo was kinda slow but has the best retention policy (lifetime of CERN, at least 1-2 more decades).
I created organizations/communities on both HuggingFace and Zenodo, with the idea that HuggingFace will be used as the main download mirror, and Zenodo for long term archive/citing/linking to the datasets publicly (Zenodo also asks for more metadata when uploading a file). Send me emails/usernames to get invited to the corresponding orgs, or if they have request to join button click that:
The one lingering question I have is how to name the repositories created -- nuget, ubuntu, debian, pypi? or (for the distros specifically) ubuntu-noble, debian-sid to also capture which major version the dataset was created from, since in a sense each major version of Ubuntu has its own separate package repository, whereas nuget and pypi just dump everything under a single index? |
For uploaded datasets, take the sqlite database (e.g. pypi.db) and toss it in a zip file (pypi.db.zip) with no subfolders. |
Uh oh!
There was an error while loading. Please reload this page.
We need to find somewhere to host the datasets that get created. Ideally the download should be available over HTTP(S), and not require users to log in.
Options include:
Kaggle(confusing)LC Green Data Oasis(needs renewing every year)The text was updated successfully, but these errors were encountered: