-
Notifications
You must be signed in to change notification settings - Fork 670
feat: run more things in parallel #3636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
I pulled this, built it locally, and then tested it with a few containers. Maybe I'm doing it wrong, but I don't see a positive difference. Is it only going to benefit certain use cases or container types? I used three different containers, and ran this syft v1.19.0 and v1.19.0 with this patch (v1.19.0-pfh), with increasing
Is there a better test I could do? docker.io/nextcloud:latest
docker.io/opensearchproject/opensearch:latest
docker.io/pytorch/pytorch:latest
|
@popey -- hmm... It's probably worth comparing apples to apples -- testing this vs. |
Ok, I'll re-run with the new update, and larger degree of parallelism. What's your definition of "small" in image terms? The ones I'm currently using are this kinda size...
|
For some reason, I didn't look at the image names 🤦 This change is a lot less about number of files and more about total bytes to process -- what are the sizes in GB? Uncompressed sizes are: |
Are these too small? I went for something a little bigger:
docker.io/nextcloud:latest
docker.io/opensearchproject/opensearch:latest
docker.io/pytorch/pytorch:latest
docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest
Hm, this is weird. I see the pfh PR has better times when using high parallelism, but not quite sure why the v1.19.0 ones are faster than the pfh ones to start with!? |
Are you still using the |
v1.19.0-pfh is this PR - rebuilt a couple of hours ago, after this PR was updated. Maybe poorly named, it's just this PR. |
@popey right, so it does not include all the other changes on |
Maybe, I'm more looking at this from a user perspective. What will 1.20 (or whatever it's called) look like compared to 1.19. |
I re-ran my tests on this PR using measure-syft. It ran against this PR and main five times each. The summary is below, and specific details from the logs are further down. Looks great! Syft Performance Test ResultsDate: 2025-02-07 15:31:21
Results
Logs snippetsMain
feat/parallelize-file-hashing
|
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to a faster syft 🚀
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Signed-off-by: Keith Zantow <kzantow@gmail.com>
Description
This PR implements concurrent cataloging at the file-level for many catalogers and plumbs through the parallelism config to the file hasher.
This is related to: #3266
This PR parallelizes the following things:
On large images, this results in significant performance improvements. Although the performance can be highly dependent on image contents an example is:
nvcr.io/nvidia/pytorch:24.08-py3
. Using a locally downloaded tar of this image, here is a comparison:Syft 1.21:
This PR:
For this image, notable approximate runtime improvements:
Fixes: #3683
Type of change
Checklist: