8000 Multiprocessing on CPU slows down stanza processing · Issue #552 · stanfordnlp/stanza · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Multiprocessing on CPU slows down stanza processing #552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johann-petrak opened this issue Dec 6, 2020 · 8 comments
Closed

Multiprocessing on CPU slows down stanza processing #552

johann-petrak opened this issue Dec 6, 2020 · 8 comments
Labels

Comments

@johann-petrak
Copy link
johann-petrak commented Dec 6, 2020

OK, I am not certain this really is a bug, but it is very weird.

I am trying to change an existing NLP pipeline from using SpaCy to using Stanza instead.
Since I want to apply this pipeline on a large number of (relatively short) texts, I am already using multiprocessing, where I run k processes in a pool. Each process reads the same input file but only processes every k+j th line if there are k processes and this is process number j of k.

If I do this with Spacy, I get a speed-up when I increase the number of processes. My server has 16 CPUs, so I get a speedup up to about 14 or so processes (also depends on the relative overhead of loading k copies of the pipeline).

With Stanza, there is a small speedup with 2 processes and beyond that, overall processing time is getting slower than just having a single process. How is this even possible?

@johann-petrak
Copy link
Author

This seems to be less of an issue when I run with the GPU enabled. Here are some numbers from experimenting with this on a server with 16 cores, 64G RAM, and an RTX 2080Ti GPU:

Running on the CPU on 500 documents, 5 repetitions:

  • n=1: 51.9 stdev=0.3
  • n=2: 32.0 stdev=0.3
  • n=3: 33.6 stdev=0.4
  • n=4: 46.4 stdev=0.6 (avg 95% utilization of all 16 CPUs)
  • n=5: 82.2 stdev=1.3 (avg 97% utilization of all 16 CPUs)
  • n=6: 146.4 stdev=2.3 (avg 98% utilization of all 16 CPUs)
    (no swap space used in any of the experiments)

Running on the GPU on 500 documents, 5 repetitions:

  • n=1: 27.9 stdev=0.08
  • n=2: 17.6 stdev=0.2
  • n=3: 14.9 stdev=0.07
  • n=4: 13.9 stdev=0.04
  • n=5: 13.8 stdev=0.04
  • n=6: 13.9 stdev=0.09 (getting close to 100% GPU mem at times, 6 CPUs used)
  • n=7: 14.2 stdev=0.04 (getting close to 100% GPU mem at times, 7 CPUs used)

@johann-petrak
8000 Copy link
Author

The program I used to debug this, needs the German model and a file texts.txt with texts in each line, where each line is used as the text to create a stanza document.

debug-stanza-mp.py.zip

@johann-petrak
Copy link
Author

For comparison, here the result with Spacy on the CPUs.
Spacy is about an order of magnitude faster on the CPU than Stanza on the GPU, so for this comparison I used 1000 documents instead of 500:

  • n=1: 3.6 stdev=0.05
  • n=2: 2.4 stdev=0.001
  • n=3: 2.0 stdev=0.002
  • n=4: 1.8 stdev=0.002
  • n=5: 1.7 stdev=0.003
  • n=6: 1.6 stdev=0.004
  • n=7: 1.6 stdev=0.002
  • n=8: 1.6 stdev=0.001
  • n=9: 2.0 stdev=0.07

@johann-petrak
Copy link
Author

The bottom line seems to be that Stanza, apart from being a lot slower than Spacy (but better, that is why I want to still use it), seems to already use multiprocessing internally when running on the CPU, so additional multiprocessing quickly hits a dead end. Is this correct?

@johann-petrak johann-petrak changed the title Multiprocessing slows down stanza processing Multiprocessing on CPU slows down stanza processing Dec 6, 2020
@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Dec 6, 2020 via email

@johann-petrak
Copy link
Author

I will try to have a look at the underlying torch code and how that works with multiprocessing depending on whether or not the GPU is used.

@qipeng
Copy link
Collaborator
qipeng commented Dec 29, 2020

It could be that PyTorch binaries are compiled with OpenMP, which uses CPU parallelism out of the box, which helped you reach the ceiling of your CPU's processing power before the number of threads/processes explicitly matched the number of cores/hyperthreads. If so, experimenting with PyTorch would have a similar effect. (SpaCy has their own neural net infrastructure so it could be different)

@johann-petrak
Copy link
Author

I think there is not much that can be done here. The more important way to tackle performance is proper streaming and batching. Some of it has been addressed already, see #550

I guess from my POV this can be closed.

AaronWChen added a commit to AaronWChen/MeaLeon that referenced this issue Mar 15, 2024
Forced restart cleared jupyter kernel, so might as well add FlagEmbedding library to create embeddings for retrieval.

Read a few discussions on how slow Stanza is with the pipeline. One thing is that I think there is a strong benefit to using lemmas for tokens, which needs many of the processing pipelines. Unfortunately, this looks like the slow part of processing.

Saw this discussion on [using multiprocessing combined with Stanza](stanfordnlp/stanza#552) and it seems like multiprocessing with GPU doesn't result in much performance increase but does drastically cause the GPU to work harder. However, discussion reminded me that I turned off GPU with Stanza in an attempt to use BERTopic with Stanza. Will re-enable and retime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
0