You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a relatively limited set of official Docker images published on Docker Hub. Grab the image IDs for all released versions of them, so when we see a saved Docker image and reconstruct the Dockerfile it was created from we can map the base image ID to an official image (+ version!). This can be a Python script under the dataset-generation subfolder that makes requests to the endpoints described later, grabs a list of all repositories in the library first, then for each one grabs all the tags, with the end result being a list of sha1 hashes mapped to ":" pairs.
The one downside is as new versions come out, the dataset will need to be constantly updated.
Relevant API endpoints for this are:
https://hub.docker.com/v2/repositories/library/?page_size=100 -- returns 100 items from the "default" Docker Hub namespace. If next is not null or an empty string, then it gives the url of the next page to get more items. For each entry in the results list, the "name" will be needed for a subsequent query to get a list of published tags and their hashes.
https://hub.docker.com/v2/repositories/library/centos/tags?page_size=100 -- returns 100 tags for centos from the "default" docker hub namespace library. Pagination works in the same way as the previous endpoint. The main info we care about in this is a mapping of the digest hashes in the images list for each tag (for different operating systems/architectures) and the digest hash for the tag to the name of the tag. Each entry in the results list appears to be for a tag (but may have multiple images listed for it as shown for different architectures/OSes).
(Note: the API described above is different from the API used to query other registries, such as the one for Windows containers hosted by microsoft at https://mcr.microsoft.com/, which uses https://mcr.microsoft.com/v2/_catalog as the starting endpoint API... this case can be handled in a follow-on task, if/when we identify an alternate container host that has popular base images for Docker containers)
The text was updated successfully, but these errors were encountered:
nightlark
changed the title
Dataset mapping of official Docker image IDs
Create dataset mapping of official Docker image IDs
Jan 22, 2025
Uh oh!
There was an error while loading. Please reload this page.
There is a relatively limited set of official Docker images published on Docker Hub. Grab the image IDs for all released versions of them, so when we see a saved Docker image and reconstruct the Dockerfile it was created from we can map the base image ID to an official image (+ version!). This can be a Python script under the dataset-generation subfolder that makes requests to the endpoints described later, grabs a list of all repositories in the library first, then for each one grabs all the tags, with the end result being a list of sha1 hashes mapped to ":" pairs.
The one downside is as new versions come out, the dataset will need to be constantly updated.
Relevant API endpoints for this are:
(Note: the API described above is different from the API used to query other registries, such as the one for Windows containers hosted by microsoft at https://mcr.microsoft.com/, which uses https://mcr.microsoft.com/v2/_catalog as the starting endpoint API... this case can be handled in a follow-on task, if/when we identify an alternate container host that has popular base images for Docker containers)
The text was updated successfully, but these errors were encountered: