-
Notifications
You must be signed in to change notification settings - Fork 909
Pipeline using "genia" throws CUDA error when multiprocessed #615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you maybe help us skip steps 2 & 3 here? |
@AngledLuffa sorry, I dont get what you mean. If I skip steps 2 & 3 everything works. If you need and example how to run the pipeline in parallel see #552. |
Clearly my explanation was lacking. What I would like would be something I
can just download and not have to edit so I can reproduce the error. I'm
sure we'll look at it soon either way, but it will be a lot faster if the
initial level of work on our end to see the bug is lower.
|
I created a script to reproduce the error: `import pandas as pd data = pd.DataFrame({"id": np.arange(1000000), class tokenize_lemmatize():
def tokenize_df(df, use_gpu):
def tokenize_df_parallel(df):
pd.set_option('display.max_columns', 3) print(data) However, I noticed that the error is only caused if you call |
Sry, the style is weird. Here is the file: stanza_mp_bug.zip |
According to this, cuda operations with torch are not capable of using multiprocessing/fork https://pytorch.org/docs/master/notes/multiprocessing.html Perhaps try the multiprocessing mechanism described there? |
Thanks for your suggestion. I moved the line The bigger issue is speed. No matter if I use the python or pytorch implementation of multiprocessing, the code runs slower than on a single core. The more processes are used (at maximum core_count/2), the longer it takes (~20-70%). The GPU is never at its maximum and most of the CPU cores are not running. However, it runs much slower which is really frustrating. I dont know what the problem is, maybe getting the data into the GPU and back... |
There's actually a new batch processing mechanism which does the batch
processing faster. It's been added since the 1.2.0 release. You could try
pip installing from the dev branch
5d2d39e
What is the structure of the data you are trying to process? For example,
lots of documents with a single sentence each? The bulk_process mechanism
should speed it up a lot, if that's the case.
|
Ok, I'm gonne have a look at it. Out data is stored internally as a pandas dataframe, for instance:
Obviously, not every row has the same text. We need to tokenize each row. The tokenized result of each text is stored in another column. |
That sounds like exactly the kind of thing the bulk_process mechanism will
help with.
This should install it, I think:
python3 -m pip install --no-cache-dir git+
https://github.com/stanfordnlp/stanza.git@5d2d39ec822c65cb5f60d547357ad8b821683e3c
|
From the description it really sounds exactly what we are looking for. Could you give an example how to use the bulk_process mechanism? |
Currently if you pass in a list of Documents it should automatically do
it. If not, please let us know
|
The command you provided does not work for me: |
Ok, I installed the I cannot get it running. If I understand correctly, if I pass a list of Documents to the Pipeline, it automatically calls the bulk_process function. So I need to create Documents from the pandas dataframe. The documentation on this does not help a lot: https://stanfordnlp.github.io/stanza/data_conversion.html.
which gives me a error in
|
Currently, this should work:
```
nlp([Document([], text=doccontent) for doccontent in docs])
```
|
It works very well. I did a quick speed comparison and the bulk_process mechanism is between 3-4 times faster than processing the entries one after another (including the time for converting the texts to Documents and the result back to a list of texts. However, most of the time is spend for tokenizing). There seems to be a sweet spot using bulks of size ~1.000. Do you think multiprocessing the bulk_process mechanism can give additional speed improvements? Because multiprocessing the standard tokenizer makes it slower... |
Might be worth a try, but I can't promise anything
|
I just saw that you do not use the pytorch Dataloader to feed the text to the model. Is there a specific reason why you use your own implementation of a Dataloader (i guess its this one: https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py). Pytorch dataloader much faster as they are multiprocessed. Using a non-pytorch dataloader might be the major issue why it is so slow. |
As with many of these things, the answer is we haven't done it yet. It's
on our list, though.
The last time I profiled the tokenizer, inference was much more expensive
than data loading, although perhaps that has changed with the new bulk
process operator.
If you can provide a patch, we'll definitely look, and otherwise it's on
our todo list
…On Fri, Feb 12, 2021, 1:47 AM ppfeiff ***@***.***> wrote:
I just saw that you do not use the pytorch Dataloader to feed the text to
the model. Is there a specific reason why you use your own implementation
of a Dataloader (i guess its this one:
https://github.com/stanfordnlp/stanza/blob/master/stanza/models/tokenization/data.py).
Pytorch dataloader much faster as they are multiprocessed. Using a
non-pytorch dataloader might be the major issue why it is so slow.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#615 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWICLDYZXNNMA5GPC6DS6T2MRANCNFSM4XEP2DQQ>
.
|
Describe the bug
Hi, for tokenizing a very large database of ~20M biomedical texts we tried to parallelize the tokenization with GPU support and multiporcessing. The same code as #552 was used, however when using the "genia" package CUDA throws an runtime error when initializing the pipeline:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
As it works with other lanuages/packages, I guess this can be fixed to enable multiprocessing with GPU-support using genia. Without GPU support, the code runs about 10x slower.
To Reproduce
Steps to reproduce the behavior:
stanza.Pipeline(lang='en', package='genia', processors='tokenize', use_gpu=True, logging_level='WARN')
pool.apply_async(function, dataset)
Environment:
The text was updated successfully, but these errors were encountered: