-
Notifications
You must be signed in to change notification settings - Fork 2
Create Debian/Ubuntu package dataset #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update/write script to output results as a sqlite database with an index generated on the file name for fast lookups. |
1/16 - regenerate dataset using normalized file name function for quick lookup file name key (ran into issue with a haskell file name normalization corner case). |
1/23 - Steven updated dataset using the updated/fixed normalization function (fixed in #52) |
2/13 - Steven generated sqlite database from Debian bookworm Contents file. About 25% of files in Debian don't appear Ubuntu -- not sure if it's Ubuntu lagging behind or choosing not to pull certain packages. Which of the packages those files belong to don't appear in Ubuntu?
Ubuntu has 4x the number of files in Debian. |
Once #6 is done, upload current Ubuntu dataset, consider also uploading Debian dataset. |
Name ubuntu/debian dataset releases with major version of distro -- ubuntu-24.04.db or ubuntu-noble.db |
@nightlark: create the repos for this on HF and Zenodo. |
Steven uploaded datasets to 3 debian (bookworm, bullseye, buster) and 3 ubuntu (jammy, focal, noble) versions to HuggingFace (https://huggingface.co/dapper-datasets). |
Uh oh!
There was an error while loading. Please reload this page.
Create a package dataset for Debian/Ubuntu that maps file names to the package(s) that could have installed the files. Relates to #5 and #8 for determining how file names should be normalized to use as a "key" for lookups. We should also consider how we may want to split up the dataset into smaller chunks based on how it will be used (e.g. only includes, only binary files, etc).
Some potential sources of data for this are:
The text was updated successfully, but these errors were encountered: