feat: run more things in parallel #3636

kzantow · 2025-01-31T19:08:51Z

Description

This PR implements concurrent cataloging at the file-level for many catalogers and plumbs through the parallelism config to the file hasher.

This is related to: #3266

This PR parallelizes the following things:

generic cataloger
dpkg cataloger
file catalogers

On large images, this results in significant performance improvements. Although the performance can be highly dependent on image contents an example is: nvcr.io/nvidia/pytorch:24.08-py3. Using a locally downloaded tar of this image, here is a comparison:

Syft 1.21:

$ time SYFT_PARALLELISM=24 syft --scope all-layers /Volumes/Stuff/tmp/pytorch.tar -v
...
[1155]  INFO task completed elapsed=15m57.424769911s task=dpkg-db-cataloger
[1198]  INFO task completed elapsed=42.365751266s task=file-digest-cataloger
...
SYFT_PARALLELISM=24 syft --scope all-layers /Volumes/Stuff/tmp/pytorch.tar -v  1189.19s user 59.83s system 102% cpu 20:13.23 total

This PR:

$ time SYFT_PARALLELISM=24 ./syftbin /Volumes/Stuff/tmp/pytorch.tar --scope all-layers -v
...
[0319]  INFO task completed elapsed=1m58.401958814s task=dpkg-db-cataloger
[0335]  INFO task completed elapsed=15.27543926s task=file-digest-cataloger
...
SYFT_PARALLELISM=24 ./syftbin /Volumes/Stuff/tmp/pytorch.tar --scope  -v  1387.98s user 77.72s system 419% cpu 5:49.21 total

For this image, notable approximate runtime improvements:

dpkg db cataloger: 15m -> 2m
file digest cataloger 42s -> 15s
subtracting the indexing time, we see cataloging time: ~17m -> ~3m

Fixes: #3683

Type of change

Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

I have added unit tests that cover changed behavior
I have tested my code in common scenarios and confirmed there are no regressions
I have added comments to my code, particularly in hard-to-understand sections

Signed-off-by: Keith Zantow <kzantow@gmail.com>

internal/file/parallel_writer.go

Signed-off-by: Keith Zantow <kzantow@gmail.com>

popey · 2025-02-06T14:32:58Z

I pulled this, built it locally, and then tested it with a few containers. Maybe I'm doing it wrong, but I don't see a positive difference. Is it only going to benefit certain use cases or container types?

I used three different containers, and ran this syft v1.19.0 and v1.19.0 with this patch (v1.19.0-pfh), with increasing SYFT_PARALLELISM setting. The remote Debian amd64 box has a 12-core Ryzen 5 3600 CPU.

SYFT_PARALLELISM makes little to no difference, and the release build is faster than this PR. It looks to me like most of the time is spent unpacking the docker container into /tmp by stereoscope, and in these cases, there is minimal time building the SBOM.

Is there a better test I could do?

docker.io/nextcloud:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	30	1
syft	v1.19.0-pfh	23	2
syft	v1.19.0-pfh	20	3
syft	v1.19.0-pfh	20	4
syft	v1.19.0-pfh	19	5
syft	v1.19.0	21	1
syft	v1.19.0	21	2
syft	v1.19.0	21	3
syft	v1.19.0	20	4
syft	v1.19.0	21	5

docker.io/opensearchproject/opensearch:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	50	1
syft	v1.19.0-pfh	39	2
syft	v1.19.0-pfh	39	3
syft	v1.19.0-pfh	33	4
syft	v1.19.0-pfh	34	5
syft	v1.19.0	32	1
syft	v1.19.0	33	2
syft	v1.19.0	35	3
syft	v1.19.0	33	4
syft	v1.19.0	33	5

docker.io/pytorch/pytorch:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	84	1
syft	v1.19.0-pfh	78	2
syft	v1.19.0-pfh	79	3
syft	v1.19.0-pfh	81	4
syft	v1.19.0-pfh	78	5
syft	v1.19.0	79	1
syft	v1.19.0	78	2
syft	v1.19.0	78	3
syft	v1.19.0	79	4
syft	v1.19.0	78	5

…le-hashing

kzantow · 2025-02-06T14:55:20Z

@popey -- hmm... ~~I wouldn't really expect to see much difference for smaller images (these are not "smaller" images)~~; it's also dependent on the number of hashers you're using: more hashers should show more difference. An reasonable parallelism setting could be based on the number of CPU cores available, maybe: num cores * 2. The higher the parallelism, the more concurrent IO the system is able to do.

It's probably worth comparing apples to apples -- testing this vs. main, since there are a number of commits. I just updated this branch to current main.

popey · 2025-02-06T15:02:25Z

8000

Ok, I'll re-run with the new update, and larger degree of parallelism. What's your definition of "small" in image terms?
I was going to try one of the huggingface ones, but it exploded disk space unpacking it.

The ones I'm currently using are this kinda size...

 ✔ Cataloged contents a52af642fd0f5e4957ce42fa27b77dd0c898223e32dbf8266664bf261[38/3014]
   ├── ✔ Packages                        [417 packages]
   ├── ✔ File metadata                   [10,573 locations]
   ├── ✔ Executables                     [1,318 executables]
   └── ✔ File digests                    [10,573 files]

 ✔ Cataloged contents 0246c8b2d4b494dc7c4776dff472ba765be63a71bd9d8d878ac9c40f573379bb
   ├── ✔ Packages                        [859 packages]
   ├── ✔ Executables                     [378 executables]
   ├── ✔ File metadata                   [6,426 locations]
   └── ✔ File digests                    [6,426 files]

 ✔ Cataloged contents 11691e035a3651d25a87116b4f6adc113a27a29d8f5a6a583f8569e0ee5ff897
   ├── ✔ Packages                        [224 packages]
   ├── ✔ File digests                    [4,394 files]
   ├── ✔ File metadata                   [4,394 locations]
   └── ✔ Executables                     [1,341 executables]

kzantow · 2025-02-06T15:04:10Z

For some reason, I didn't look at the image names 🤦 This change is a lot less about number of files and more about total bytes to process -- what are the sizes in GB?

Uncompressed sizes are:
nextcloud: ~400 MB
opensearchproject/opensearch: ~900 MB
pytorch/pytorch: ~3.4 GB

popey · 2025-02-06T16:13:16Z

Image	Size
docker.io/nextcloud:latest	1.29GB
docker.io/opensearchproject/opensearch:latest	1.12GB
docker.io/pytorch/pytorch:latest	7.6GB

Are these too small? I went for something a little bigger:

Image	Size
docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest	21.3GB

docker.io/nextcloud:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	30	1
syft	v1.19.0-pfh	19	24
syft	v1.19.0	21	1
syft	v1.19.0	21	24

docker.io/opensearchproject/opensearch:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	51	1
syft	v1.19.0-pfh	33	24
syft	v1.19.0	32	1
syft	v1.19.0	32	24

docker.io/pytorch/pytorch:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	82	1
syft	v1.19.0-pfh	79	24
syft	v1.19.0	81	1
syft	v1.19.0	78	24

docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest

program	release	duration	SYFT_PARALLELISM
syft	v1.19.0-pfh	418	1
syft	v1.19.0-pfh	232	24
syft	v1.19.0	266	1
syft	v1.19.0	258	24

Hm, this is weird. I see the pfh PR has better times when using high parallelism, but not quite sure why the v1.19.0 ones are faster than the pfh ones to start with!?

kzantow · 2025-02-06T16:15:57Z

Are you still using the v1.19.0 release instead of something built off main? I don't know exactly what v1.19.0-pfh is -- the branch in this PR? this branch merged with v1.19.0?

popey · 2025-02-06T17:18:35Z

v1.19.0-pfh is this PR - rebuilt a couple of hours ago, after this PR was updated. Maybe poorly named, it's just this PR.
v1.19.0 is the release binary we put on Github.

kzantow · 2025-02-06T17:26:09Z

@popey right, so it does not include all the other changes on main, it should be compared to main

popey · 2025-02-06T17:50:55Z

Maybe, I'm more looking at this from a user perspective. What will 1.20 (or whatever it's called) look like compared to 1.19.
I will run this again, comparing the PR with the tip of main to get a more isolated comparison.

popey · 2025-02-07T15:04:54Z

I re-ran my tests on this PR using measure-syft. It ran against this PR and main five times each. The summary is below, and specific details from the logs are further down. Looks great!

Syft Performance Test Results

Date: 2025-02-07 15:31:21
Container: docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest
Environment Variables:

SYFT_CHECK_FOR_APP_UPDATE=false
SYFT_PARALLELISM=24

Results

Version/Description	Commit	Min (s)	Max (s)	Avg (s)
main	-	251.18	258.07	253.75
feat/parallelize-file-hashing	-	229.86	233.74	231.25

Logs snippets

Main

$ grep file-digest-cataloger results/logs/syft_e584c9f4_run*
results/logs/syft_e584c9f4_run1_2025-02-07_145054.log:[0252]  INFO task completed elapsed=37.945572765s task=file-digest-cataloger
results/logs/syft_e584c9f4_run2_2025-02-07_145512.log:[0245]  INFO task completed elapsed=37.784157195s task=file-digest-cataloger
results/logs/syft_e584c9f4_run3_2025-02-07_145923.log:[0249]  INFO task completed elapsed=37.939367806s task=file-digest-cataloger
results/logs/syft_e584c9f4_run4_2025-02-07_150339.log:[0245]  INFO task completed elapsed=38.151224264s task=file-digest-cataloger
results/logs/syft_e584c9f4_run5_2025-02-07_150750.log:[0246]  INFO task completed elapsed=38.071557398s task=file-digest-cataloger

feat/parallelize-file-hashing

$ grep  file-digest-cataloger results/logs/syft_c5961b53_run*
results/logs/syft_c5961b53_run1_2025-02-07_151205.log:[0222]  INFO task completed elapsed=13.192338452s task=file-digest-cataloger
results/logs/syft_c5961b53_run2_2025-02-07_151555.log:[0224]  INFO task completed elapsed=12.47219267s task=file-digest-cataloger
results/logs/syft_c5961b53_run3_2025-02-07_151945.log:[0226]  INFO task completed elapsed=13.223577256s task=file-digest-cataloger
results/logs/syft_c5961b53_run4_2025-02-07_152337.log:[0227]  INFO task completed elapsed=13.637332118s task=file-digest-cataloger
results/logs/syft_c5961b53_run5_2025-02-07_152731.log:[0224]  INFO task completed elapsed=11.701824755s task=file-digest-cataloger

Signed-off-by: Keith Zantow <kzantow@gmail.com>

…le-hashing

Signed-off-by: Keith Zantow <kzantow@gmail.com>

…le-hashing

Signed-off-by: Keith Zantow <kzantow@gmail.com>

…le-hashing

wagoodman

Looking forward to a faster syft 🚀

Signed-off-by: Keith Zantow <kzantow@gmail.com>

kzantow added 2 commits January 31, 2025 14:06

feat: run file hash algorithms in parallel

3ba4421

Signed-off-by: Keith Zantow <kzantow@gmail.com>

chore: update tests

ffa1361

Signed-off-by: Keith Zantow <kzantow@gmail.com>

kzantow commented Jan 31, 2025

View reviewed changes

internal/file/parallel_writer.go Outdated Show resolved Hide resolved

kzantow added 2 commits February 1, 2025 12:23

feat: parallelize file hashing per-file

c916120

Signed-off-by: Keith Zantow <kzantow@gmail.com>

chore: update tests

2028499

Signed-off-by: Keith Zantow <kzantow@gmail.com>

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

c5961b5

…le-hashing

chore: move parallel writer to go-sync, update to use reducer

c1f7616

Signed-off-by: Keith Zantow <kzantow@gmail.com>

wagoodman assigned kzantow Feb 11, 2025

kzantow added 2 commits February 12, 2025 23:10

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

8304ea3

…le-hashing

feat: parallelize more stuff

7472ea4

Signed-off-by: Keith Zantow <kzantow@gmail.com>

kzantow changed the title ~~feat: run file hash algorithms in parallel~~ feat: more things in parallel Feb 13, 2025

kzantow changed the title ~~feat: more things in parallel~~ feat: run more things in parallel Feb 13, 2025

chore: wrapped child executors

bb66466

Signed-off-by: Keith Zantow <kzantow@gmail.com>

kzantow mentioned this pull request Mar 6, 2025

Allow cancelling tasks that generate SBOM with syft.CreateSBOM() #3705

Open

wagoodman mentioned this pull request Mar 11, 2025

Allow for concurrent cataloger parser calls #3266

Closed

kzantow added 5 commits March 11, 2025 18:56

chore: update implementation based on feedback

da2e66c

Signed-off-by: Keith Zantow <kzantow@gmail.com>

chore: update go mod to point to latest go-sync branch

9655ee7

Signed-off-by: Keith Zantow <kzantow@gmail.com>

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

e68cb0c

…le-hashing

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

95666da

…le-hashing

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

5653b48

…le-hashing

kzantow added 2 commits March 20, 2025 11:34

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

2066b44

…le-hashing

chore: update to finalized go sync API

3685e57

Signed-off-by: Keith Zantow <kzantow@gmail.com>

kzantow marked this pull request as ready for review March 20, 2025 22:04

kzantow mentioned this pull request Mar 20, 2025

Parallel Collect & io.Writer anchore/go-sync#12

Merged

Merge remote-tracking branch 'upstream/main' into feat/parallelize-fi…

e388e38

…le-hashing

wagoodman approved these changes Mar 25, 2025

View reviewed changes

kzantow added 2 commits March 26, 2025 10:10

chore: update to latest go-sync, cleanup

74f52af

Signed-off-by: Keith Zantow <kzantow@gmail.com>

chore: add --parallelism flag, update command help text

7fafe73

Signed-off-by: Keith Zantow <kzantow@gmail.com>

kzantow merged commit 4a94378 into main Mar 26, 2025
13 checks passed

kzantow deleted the feat/parallelize-file-hashing branch March 26, 2025 15:10

github-project-automation bot moved this from In Progress to Done in OSS Mar 26, 2025

BrewTestBot mentioned this pull request Apr 1, 2025

syft 1.22.0 Homebrew/homebrew-core#217593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: run more things in parallel #3636

feat: run more things in parallel #3636

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: run more things in parallel #3636

feat: run more things in parallel #3636

Uh oh!

Conversation

Uh oh!

Description

Type of change

Checklist:

Uh oh!

Uh oh!

docker.io/nextcloud:latest

docker.io/opensearchproject/opensearch:latest

docker.io/pytorch/pytorch:latest

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

docker.io/nextcloud:latest

docker.io/opensearchproject/opensearch:latest

docker.io/pytorch/pytorch:latest

docker.io/huggingface/transformers-all-latest-torch-nightly-gpu:latest

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Syft Performance Test Results

Results

Logs snippets

Main

feat/parallelize-file-hashing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!