8000 Parallel import scanning (python) by Peter554 · Pull Request #198 · seddonym/grimp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Parallel import scanning (python) #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 8, 2025

Conversation

Peter554
Copy link
Contributor
@Peter554 Peter554 commented Apr 4, 2025

This PR uses CPU parallelism to accelerate import scanning. For large codebases this should make building the graph much faster (on machines with multiple CPU cores).

On my machine (8 cores):

  • Before: Building graph for Kraken without cache took 46s.
  • After: Building graph for Kraken without cache takes 14s.

We hope to move more and more of this graph building code into rust, so this change may not live very long. From my perspective the change can still be valuable though:

  • It's a small change, so little effort.
  • It allows for a fairer comparison when moving this code to rust (is the speed up we later achieve simply due to parallelism, or something else more rust specific?).

A previous iteration by @duzumaki here #142. Changes compared to that PR:

Benchmarks suggest building without the cache is much faster, but building with the cache is a little slower #198 (comment)

@Peter554 Peter554 force-pushed the parallel-import-scanning branch 2 times, most recently from 2c2fa5d to 2352163 Compare April 4, 2025 08:18
Copy link
codspeed-hq bot commented Apr 4, 2025

CodSpeed Instrumentation Performance Report

Merging #198 will improve performances by ×43

Comparing Peter554:parallel-import-scanning (ab2b66a) with master (23a1e85)

Summary

⚡ 3 improvements
✅ 19 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_build_django_from_cache_a_few_misses[15] 375.6 ms 173 ms ×2.2
test_build_django_from_cache_a_few_misses[350] 5,790.1 ms 295.7 ms ×20
test_build_django_uncached 6,147.9 ms 143.4 ms ×43

@Peter554 Peter554 force-pushed the parallel-import-scanning branch from 2352163 to bfc829a Compare April 4, 2025 08:52
@Peter554 Peter554 marked this pull request as ready for review April 4, 2025 09:09
@Peter554 Peter554 force-pushed the parallel-import-scanning branch from bfc829a to ee3f43c Compare April 4, 2025 09:10
Comment on lines 65 to 68
def __reduce__(self):
return SourceSyntaxError, (self.filename, self.lineno, self.text)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a comment to explain why this is necessary.

@Peter554
Copy link
Contributor Author
Peter554 commented Apr 4, 2025
Screenshot 2025-04-04 at 11 42 33

@seddonym This doesn't seem believable ☝️ - I think it can have something to do with https://docs.codspeed.io/instruments/cpu/#system-calls

Benchmarks on my machine look more believable:

------- benchmark 'test_build_django_from_cache': 2 tests -------
Name (time in ms)                                  Mean
-----------------------------------------------------------------
test_build_django_from_cache (master)     36.0710 (1.0)
test_build_django_from_cache (NOW)              72.1804 (2.00)
-----------------------------------------------------------------

-------- benchmark 'test_build_django_uncached': 2 tests ---------
Name (time in ms)                                   Mean
------------------------------------------------------------------
test_build_django_uncached (NOW)                251.8374 (1.0)
test_build_django_uncached (master)     1,064.6040 (4.23)
------------------------------------------------------------------

Copy link
Owner
@seddonym seddonym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thank you! A real game changer.

I would like to address the degraded performance when the cache is warm - but it looks like that is really easily done. I'll open a separate PR to benchmark using the cache for small numbers of changes, and we can rebase off this to check there isn't degraded performance there too.

I'm not yet sold on introducing joblib - would prefer not to introduce third party dependencies unless there's a really strong reason. Could we try this using the standard library instead?

Comment on lines 65 to 68
def __reduce__(self):
return SourceSyntaxError, (self.filename, self.lineno, self.text)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a comment to explain why this is necessary.

for found_package in found_packages:
for module_file in found_package.module_files:
module = module_file.module
modules_files_to_scan = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be module_files_to_scan?

@@ -184,3 +203,21 @@ def _is_external(module: Module, found_packages: Set[FoundPackage]) -> bool:
module.is_descendant_of(package_module) or module == package_module
for package_module in package_modules
)


def _create_chunks(module_files: List[ModuleFile], *, n_chunks: int) -> List[List[ModuleFile]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 module_files could be Collection[ModuleFile] for more flexibility.

@@ -184,3 +203,21 @@ def _is_external(module: Module, found_packages: Set[FoundPackage]) -> bool:
module.is_descendant_of(package_module) or module == package_module
for package_module in package_modules
)


def _create_chunks(module_files: List[ModuleFile], *, n_chunks: int) -> List[List[ModuleFile]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐼 Always sceptical when I see a list 😉 .

Maybe tuple[tuple[ModuleFile], ...] would be a slightly better return type as it's immutable?

Also considered frozenset[frozenset[ModuleFile]] but I wonder if hashing might incur a penalty here.

def _scan_imports(
import_scanner: AbstractImportScanner,
exclude_type_checking_imports: bool,
module_files: List[ModuleFile],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be Iterable[ModuleFile] for greater flexibility.

for module_file in modules_files_to_scan.difference(imports_by_module_file):
imports_by_module_file[module_file] = import_scanner.scan_for_imports(
module_file.module, exclude_type_checking_imports=exclude_type_checking_imports
remaining_modules_files_to_scan = list(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the _scan_packages function has tipped over into becoming too long now - would you mind breaking it up a bit?

chunked_remaining_modules_files_to_scan = _create_chunks(
remaining_modules_files_to_scan, n_chunks=N_CPUS
)
with parallel_config(n_jobs=N_CPUS):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the cache is full, remaining_modules_files_to_scan is empty - if that's the case it's a waste of time to spin up these processes (and it noticeably degrades performance). At the very least we should wrap this in if remaining_modules_files_to_scan. I tried that locally and it recovered the degraded performance for a fully populated cache.

Also, n_jobs should be min(N_CPUS, len(remaining_module_files_to_scan)), right? I tried that locally and it improves the situation if I've only changed one file. It would be good to add a benchmark for that scenario as I don't think we currently measure it.

I'm not sure what the threshold is multiprocessing becomes worthwhile, but we could possibly improve on the logic a little by having a non-parallel track for this too, for small numbers of changes. That said, I tried the approach of spinning up only one extra process for one changed module and it wasn't noticeably slower, so it's possibly not worth it.

@Peter554 Peter554 force-pushed the parallel-import-scanning branch from ee3f43c to 3f6fdc3 Compare April 4, 2025 16:30
@seddonym
Copy link
Owner
seddonym commented Apr 4, 2025

Oh - just remembered one thing, would you mind adding a line to the CHANGELOG too?

@Peter554 Peter554 force-pushed the parallel-import-scanning branch 3 times, most recently from e6de21e to f16dff9 Compare April 4, 2025 19:27
Copy link
Contributor Author
@Peter554 Peter554 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seddonym I've added a pure multiprocessing version as a fixup now. It does seem to be a bit slower.

  • When running tests locally I notice the difference in the overall time of the test suite.
  • When running benchmarks I see that it's quite a bit slower than joblib. Benchmarks on my machine:
------- benchmark 'test_build_django_from_cache':  --------------
Name (time in ms)                                         Mean
-----------------------------------------------------------------
test_build_django_from_cache (joblib)                     37.9565
test_build_django_from_cache (master)                     38.1530
test_build_django_from_cache (multiprocessing)            39.2882
-----------------------------------------------------------------

-------- benchmark 'test_build_django_uncached':  ----------------
Name (time in ms)                                         Mean
------------------------------------------------------------------
test_build_django_uncached (joblib)                       250.2343
test_build_django_uncached (multiprocessing)              421.2143
test_build_django_uncached (master)                     1,066.4864
------------------------------------------------------------------

I think it's likely okay to use joblib here:

  • joblib has 4000 stars on GitHub, so isn't going anywhere too soon.
  • joblib has no dependences beyond python and emphasises robustness over features (see "design choices" https://joblib.readthedocs.io/en/stable/why.html#design-choices)
  • Absolute worst case if this dependency breaks we can just remove it again and things get a bit slower.

So I'd suggest I remove that fixup commit and we go with the joblib version.

What do you think?

@Peter554 Peter554 requested a review from seddonym April 4, 2025 19:44
@Peter554 Peter554 force-pushed the parallel-import-scanning branch 6 times, most recently from 64c9f7d to 49ef093 Compare April 7, 2025 12:45
Peter554 added 3 commits April 8, 2025 13:36
This is needed to ensure that the error can be sent between processes.

This ensures that the test `test_syntax_error_includes_module` still passes
after using multiprocessing.
This is helpful, to avoid having to pass the large cache object
to the multiprocessing code.
@Peter554 Peter554 force-pushed the parallel-import-scanning branch 2 times, most recently from 613f0cd to 813881d Compare April 8, 2025 11:43
@Peter554 Peter554 force-pushed the parallel-import-scanning branch from 813881d to ab2b66a Compare April 8, 2025 14:31
Copy link
Owner
@seddonym seddonym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, thanks so much for this.

As discussed elsewhere, we should probably add an optimization for small numbers of modules where it's not worth multiprocessing. That will speed up the functional tests, if nothing else.

@@ -186,3 +182,38 @@ def _is_external(module: Module, found_packages: Set[FoundPackage]) -> bool:
module.is_descendant_of(package_module) or module == package_module
for package_module in package_modules
)


def _read_imports_from_cache(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for breaking this up, much easier to read.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I wonder what the overhead of this is, probably tiny but it might make sense to calculate it at runtime, only if we need it.

Maybe worth tweaking next time we're in this file.

@seddonym seddonym merged commit 30e36bc into seddonym:master Apr 8, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0
@@ -14,6 +15,8 @@
from ..domain.valueobjects import DirectImport, Module
from .config import settings

N_CPUS = multiprocessing.cpu_count()