Multiprocessing on CPU slows down stanza processing #552

johann-petrak · 2020-12-06T12:56:56Z

OK, I am not certain this really is a bug, but it is very weird.

I am trying to change an existing NLP pipeline from using SpaCy to using Stanza instead.
Since I want to apply this pipeline on a large number of (relatively short) texts, I am already using multiprocessing, where I run k processes in a pool. Each process reads the same input file but only processes every k+j th line if there are k processes and this is process number j of k.

If I do this with Spacy, I get a speed-up when I increase the number of processes. My server has 16 CPUs, so I get a speedup up to about 14 or so processes (also depends on the relative overhead of loading k copies of the pipeline).

With Stanza, there is a small speedup with 2 processes and beyond that, overall processing time is getting slower than just having a single process. How is this even possible?

johann-petrak · 2020-12-06T14:38:47Z

This seems to be less of an issue when I run with the GPU enabled. Here are some numbers from experimenting with this on a server with 16 cores, 64G RAM, and an RTX 2080Ti GPU:

Running on the CPU on 500 documents, 5 repetitions:

n=1: 51.9 stdev=0.3
n=2: 32.0 stdev=0.3
n=3: 33.6 stdev=0.4
n=4: 46.4 stdev=0.6 (avg 95% utilization of all 16 CPUs)
n=5: 82.2 stdev=1.3 (avg 97% utilization of all 16 CPUs)
n=6: 146.4 stdev=2.3 (avg 98% utilization of all 16 CPUs)
(no swap space used in any of the experiments)

Running on the GPU on 500 documents, 5 repetitions:

n=1: 27.9 stdev=0.08
n=2: 17.6 stdev=0.2
n=3: 14.9 stdev=0.07
n=4: 13.9 stdev=0.04
n=5: 13.8 stdev=0.04
n=6: 13.9 stdev=0.09 (getting close to 100% GPU mem at times, 6 CPUs used)
n=7: 14.2 stdev=0.04 (getting close to 100% GPU mem at times, 7 CPUs used)

johann-petrak · 2020-12-06T14:45:01Z

The program I used to debug this, needs the German model and a file texts.txt with texts in each line, where each line is used as the text to create a stanza document.

debug-stanza-mp.py.zip

johann-petrak · 2020-12-06T14:54:47Z

For comparison, here the result with Spacy on the CPUs.
Spacy is about an order of magnitude faster on the CPU than Stanza on the GPU, so for this comparison I used 1000 documents instead of 500:

n=1: 3.6 stdev=0.05
n=2: 2.4 stdev=0.001
n=3: 2.0 stdev=0.002
n=4: 1.8 stdev=0.002
n=5: 1.7 stdev=0.003
n=6: 1.6 stdev=0.004
n=7: 1.6 stdev=0.002
n=8: 1.6 stdev=0.001
n=9: 2.0 stdev=0.07

johann-petrak · 2020-12-06T14:56:32Z

The bottom line seems to be that Stanza, apart from being a lot slower than Spacy (but better, that is why I want to still use it), seems to already use multiprocessing internally when running on the CPU, so additional multiprocessing quickly hits a dead end. Is this correct?

AngledLuffa · 2020-12-06T17:35:54Z

I can't think of anything we did that would make multiprocessing less useful, but torch uses multiple threads when operations are done on the CPU. That probably explains the difference. Of course, it also suggests that when on the GPU, there is a limit to how much benefit you'll get from multiprocessing before the GPU is hot enough to cook an egg. FWIW the next version (hopefully by the end of the year) will have some performance improvements, although not a factor of 10x I'm afraid

johann-petrak · 2020-12-08T14:35:51Z

I will try to have a look at the underlying torch code and how that works with multiprocessing depending on whether or not the GPU is used.

qipeng · 2020-12-29T17:42:39Z

It could be that PyTorch binaries are compiled with OpenMP, which uses CPU parallelism out of the box, which helped you reach the ceiling of your CPU's processing power before the number of threads/processes explicitly matched the number of cores/hyperthreads. If so, experimenting with PyTorch would have a similar effect. (SpaCy has their own neural net infrastructure so it could be different)

johann-petrak · 2021-03-17T09:12:31Z

I think there is not much that can be done here. The more important way to tackle performance is proper streaming and batching. Some of it has been addressed already, see #550

I guess from my POV this can be closed.

Forced restart cleared jupyter kernel, so might as well add FlagEmbedding library to create embeddings for retrieval. Read a few discussions on how slow Stanza is with the pipeline. One thing is that I think there is a strong benefit to using lemmas for tokens, which needs many of the processing pipelines. Unfortunately, this looks like the slow part of processing. Saw this discussion on [using multiprocessing combined with Stanza](stanfordnlp/stanza#552) and it seems like multiprocessing with GPU doesn't result in much performance increase but does drastically cause the GPU to work harder. However, discussion reminded me that I turned off GPU with Stanza in an attempt to use BERTopic with Stanza. Will re-enable and retime.

johann-petrak added the bug label Dec 6, 2020

johann-petrak changed the title ~~Multiprocessing slows down stanza processing~~ Multiprocessing on CPU slows down stanza processing Dec 6, 2020

ppfeiff mentioned this issue Feb 5, 2021

Pipeline using "genia" throws CUDA error when multiprocessed #615

Open

AngledLuffa closed this as completed Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiprocessing on CPU slows down stanza processing #552

Multiprocessing on CPU slows down stanza processing #552

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Multiprocessing on CPU slows down stanza processing #552

Multiprocessing on CPU slows down stanza processing #552

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!