Description
There is a relatively limited set of official Docker images published on Docker Hub. Grab the image IDs for all released versions of them, so when we see a saved Docker image and reconstruct the Dockerfile it was created from we can map the base image ID to an official image (+ version!). This can be a Python script under the dataset-generation subfolder that makes requests to the endpoints described later, grabs a list of all repositories in the library first, then for each one grabs all the tags, with the end result being a list of sha1 hashes mapped to ":" pairs.
The one downside is as new versions come out, the dataset will need to be constantly updated.
Relevant API endpoints for this are:
- https://hub.docker.com/v2/repositories/library/?page_size=100 -- returns 100 items from the "default" Docker Hub namespace. If next is not null or an empty string, then it gives the url of the next page to get more items. For each entry in the results list, the "name" will be needed for a subsequent query to get a list of published tags and their hashes.
-
{ "count":177, "next":"https://hub.docker.com/v2/repositories/library/?page=2&page_size=100", "previous":null, "results": [ { "name":"centos", "namespace":"library", "repository_type":"image", "status":1, "status_description":"active", "description":"DEPRECATED; The official build of CentOS.", "is_private":false, "star_count":7763, "pull_count":1168131338, "last_updated":"2022-12-09T19:13:54.287062Z", "last_modified":"2024-10-16T13:48:34.145251Z", "date_registered":"2013-04-30T20:54:08Z", "affiliation":"", "media_types":["application/vnd.docker.container.image.v1+json","application/vnd.docker.distribution.manifest.list.v2+json"], "content_types":["image"], "categories":[], "storage_size":25907042872 }, ] }
-
- https://hub.docker.com/v2/repositories/library/centos/tags?page_size=100 -- returns 100 tags for centos from the "default" docker hub namespace library. Pagination works in the same way as the previous endpoint. The main info we care about in this is a mapping of the digest hashes in the images list for each tag (for different operating systems/architectures) and the digest hash for the tag to the name of the tag. Each entry in the results list appears to be for a tag (but may have multiple images listed for it as shown for different architectures/OSes).
-
{ "count":49, "next":null, "previous":null, "results":[ { "creator":7, "id":2107, "images": [ { "architecture":"amd64", "features":"", "variant":null, "digest":"sha256:a1801b843b1bfaf77c501e7a6d3f709401a1e0c83863037fa3aab063a7fdb9dc", "os":"linux", "os_features":"", "os_version":null, "size":83518086, "status":"active", "last_pulled":"2025-01-22T00:23:40.260217Z", "last_pushed":"2021-09-15T18:38:28.495635Z" }, { "architecture":"arm64", "features":"", "variant":"v8", "digest":"sha256:65a4aad1156d8a0679537cb78519a17eb7142e05a968b26a5361153006224fdc", "os":"linux", "os_features":"", "os_version":null, "size":83941353, "status":"active", "last_pulled":"2025-01-22T00:34:28.042215Z", "last_pushed":"2021-09-15T17:56:15.896953Z" }, ], "last_updated":"2021-09-15T18:38:56.608195Z", "last_updater":1156886, "last_updater_username":"doijanky", "name":"latest", "repository":54, "full_size":83518086, "v2":true, "tag_status":"active", "tag_last_pulled":"2025-01-22T00:34:28.042215Z", "tag_last_pushed":"2021-09-15T18:38:56.608195Z", "media_type":"application/vnd.docker.distribution.manifest.list.v2+json", "content_type":"image", "digest":"sha256:a27fd8080b517143cbbbab9dfb7c8571c40d67d534bbdee55bd6c473f432b177" }, ] }
-
(Note: the API described above is different from the API used to query other registries, such as the one for Windows containers hosted by microsoft at https://mcr.microsoft.com/, which uses https://mcr.microsoft.com/v2/_catalog as the starting endpoint API... this case can be handled in a follow-on task, if/when we identify an alternate container host that has popular base images for Docker containers)