-
Notifications
You must be signed in to change notification settings - Fork 909
Multiprocessing on CPU slows down stanza processing #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This seems to be less of an issue when I run with the GPU enabled. Here are some numbers from experimenting with this on a server with 16 cores, 64G RAM, and an RTX 2080Ti GPU: Running on the CPU on 500 documents, 5 repetitions:
Running on the GPU on 500 documents, 5 repetitions:
|
The program I used to debug this, needs the German model and a file texts.txt with texts in each line, where each line is used as the text to create a stanza document. |
For comparison, here the result with Spacy on the CPUs.
|
The bottom line seems to be that Stanza, apart from being a lot slower than Spacy (but better, that is why I want to still use it), seems to already use multiprocessing internally when running on the CPU, so additional multiprocessing quickly hits a dead end. Is this correct? |
I can't think of anything we did that would make multiprocessing less
useful, but torch uses multiple threads when operations are done on the
CPU. That probably explains the difference. Of course, it also suggests
that when on the GPU, there is a limit to how much benefit you'll get from
multiprocessing before the GPU is hot enough to cook an egg.
FWIW the next version (hopefully by the end of the year) will have some
performance improvements, although not a factor of 10x I'm afraid
|
I will try to have a look at the underlying torch code and how that works with multiprocessing depending on whether or not the GPU is used. |
It could be that PyTorch binaries are compiled with OpenMP, which uses CPU parallelism out of the box, which helped you reach the ceiling of your CPU's processing power before the number of threads/processes explicitly matched the number of cores/hyperthreads. If so, experimenting with PyTorch would have a similar effect. (SpaCy has their own neural net infrastructure so it could be different) |
I think there is not much that can be done here. The more important way to tackle performance is proper streaming and batching. Some of it has been addressed already, see #550 I guess from my POV this can be closed. |
Forced restart cleared jupyter kernel, so might as well add FlagEmbedding library to create embeddings for retrieval. Read a few discussions on how slow Stanza is with the pipeline. One thing is that I think there is a strong benefit to using lemmas for tokens, which needs many of the processing pipelines. Unfortunately, this looks like the slow part of processing. Saw this discussion on [using multiprocessing combined with Stanza](stanfordnlp/stanza#552) and it seems like multiprocessing with GPU doesn't result in much performance increase but does drastically cause the GPU to work harder. However, discussion reminded me that I turned off GPU with Stanza in an attempt to use BERTopic with Stanza. Will re-enable and retime.
Uh oh!
There was an error while loading. Please reload this page.
OK, I am not certain this really is a bug, but it is very weird.
I am trying to change an existing NLP pipeline from using SpaCy to using Stanza instead.
Since I want to apply this pipeline on a large number of (relatively short) texts, I am already using multiprocessing, where I run k processes in a pool. Each process reads the same input file but only processes every k+j th line if there are k processes and this is process number j of k.
If I do this with Spacy, I get a speed-up when I increase the number of processes. My server has 16 CPUs, so I get a speedup up to about 14 or so processes (also depends on the relative overhead of loading k copies of the pipeline).
With Stanza, there is a small speedup with 2 processes and beyond that, overall processing time is getting slower than just having a single process. How is this even possible?
The text was updated successfully, but these errors were encountered: