8000 Create dataset mapping of official Docker image IDs · Issue #50 · LLNL/dapper · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Create dataset mapping of official Docker image IDs #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nightlark opened this issue Jan 22, 2025 · 3 comments
Open

Create dataset mapping of official Docker image IDs #50

nightlark opened this issue Jan 22, 2025 · 3 comments
Assignees

Comments

@nightlark
Copy link
Collaborator
nightlark commented Jan 22, 2025

There is a relatively limited set of official Docker images published on Docker Hub. Grab the image IDs for all released versions of them, so when we see a saved Docker image and reconstruct the Dockerfile it was created from we can map the base image ID to an official image (+ version!). This can be a Python script under the dataset-generation subfolder that makes requests to the endpoints described later, grabs a list of all repositories in the library first, then for each one grabs all the tags, with the end result being a list of sha1 hashes mapped to ":" pairs.

The one downside is as new versions come out, the dataset will need to be constantly updated.

Relevant API endpoints for this are:

  • https://hub.docker.com/v2/repositories/library/?page_size=100 -- returns 100 items from the "default" Docker Hub namespace. If next is not null or an empty string, then it gives the url of the next page to get more items. For each entry in the results list, the "name" will be needed for a subsequent query to get a list of published tags and their hashes.
    • {
        "count":177,
        "next":"https://hub.docker.com/v2/repositories/library/?page=2&page_size=100",
        "previous":null,
        "results":
            [
                {
                    "name":"centos",
                    "namespace":"library",
                    "repository_type":"image",
                    "status":1,
                    "status_description":"active",
                    "description":"DEPRECATED; The official build of CentOS.",
                    "is_private":false,
                    "star_count":7763,
                    "pull_count":1168131338,
                    "last_updated":"2022-12-09T19:13:54.287062Z",
                    "last_modified":"2024-10-16T13:48:34.145251Z",
                    "date_registered":"2013-04-30T20:54:08Z",
                    "affiliation":"",
                    "media_types":["application/vnd.docker.container.image.v1+json","application/vnd.docker.distribution.manifest.list.v2+json"],
                    "content_types":["image"],
                    "categories":[],
                    "storage_size":25907042872
                },
            ]
      }
  • https://hub.docker.com/v2/repositories/library/centos/tags?page_size=100 -- returns 100 tags for centos from the "default" docker hub namespace library. Pagination works in the same way as the previous endpoint. The main info we care about in this is a mapping of the digest hashes in the images list for each tag (for different operating systems/architectures) and the digest hash for the tag to the name of the tag. Each entry in the results list appears to be for a tag (but may have multiple images listed for it as shown for different architectures/OSes).
    • {
        "count":49,
        "next":null,
        "previous":null,
        "results":[
          {
            "creator":7,
            "id":2107,
            "images": [
              {
                "architecture":"amd64",
                "features":"",
                "variant":null,
                "digest":"sha256:a1801b843b1bfaf77c501e7a6d3f709401a1e0c83863037fa3aab063a7fdb9dc",
                "os":"linux",
                "os_features":"",
                "os_version":null,
                "size":83518086,
                "status":"active",
                "last_pulled":"2025-01-22T00:23:40.260217Z",
                "last_pushed":"2021-09-15T18:38:28.495635Z"
              },
              {
                "architecture":"arm64",
                "features":"",
                "variant":"v8",
                "digest":"sha256:65a4aad1156d8a0679537cb78519a17eb7142e05a968b26a5361153006224fdc",
                "os":"linux",
                "os_features":"",
                "os_version":null,
                "size":83941353,
                "status":"active",
                "last_pulled":"2025-01-22T00:34:28.042215Z",
                "last_pushed":"2021-09-15T17:56:15.896953Z"
              },
            ],
            "last_updated":"2021-09-15T18:38:56.608195Z",
            "last_updater":1156886,
            "last_updater_username":"doijanky",
            "name":"latest",
            "repository":54,
            "full_size":83518086,
            "v2":true,
            "tag_status":"active",
            "tag_last_pulled":"2025-01-22T00:34:28.042215Z",
            "tag_last_pushed":"2021-09-15T18:38:56.608195Z",
            "media_type":"application/vnd.docker.distribution.manifest.list.v2+json",
            "content_type":"image",
            "digest":"sha256:a27fd8080b517143cbbbab9dfb7c8571c40d67d534bbdee55bd6c473f432b177"
          },
        ]
      }

(Note: the API described above is different from the API used to query other registries, such as the one for Windows containers hosted by microsoft at https://mcr.microsoft.com/, which uses https://mcr.microsoft.com/v2/_catalog as the starting endpoint API... this case can be handled in a follow-on task, if/when we identify an alternate container host that has popular base images for Docker containers)

@nightlark nightlark changed the title Dataset mapping of official Docker image IDs Create dataset mapping of official Docker image IDs Jan 22, 2025
@nightlark
Copy link
Collaborator Author

Another endpoint that could be useful after getting information on all the tags is:

Essentially, it is returning what is in https://github.com/docker-library/busybox/blob/master/latest/glibc/amd64/image-manifest.json for the "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip" in layers.

It looks like if there is a way to fetch "mediaType": "application/vnd.oci.image.config.v1+json" config, that might give something like the image https://github.com/docker-library/busybox/blob/master/latest/glibc/amd64/image-config.json that defines the rootfs and gives a digest for the uncompressed layer.

@nightlark nightlark self-assigned this Jan 23, 2025
@nightlark
Copy link
Collaborator Author

2/20: Finished scrapping all Docker Hub images ("library"/"_" namespace) published prior to 1/22.

@nightlark
Copy link
Collaborator Author

2/27: Still need to write code that scrapes all newly added tags since last scraping, and convert scrapped covert scrapped data into sqlite database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0